Meta introduces SAM Audio, a unified multimodal model that isolates sounds from complex audio mixtures using text, visual, or temporal prompts. Built on the Perception Encoder Audiovisual (PE-AV), it achieves state-of-the-art performance across speech, music, and general sound separation. The release includes SAM Audio-Bench (the first in-the-wild audio separation benchmark), SAM Audio Judge (an automatic evaluation model), and integration into the Segment Anything Playground. The model operates faster than real-time and supports flexible prompting combinations, though it cannot separate highly similar audio events or perform complete separation without prompts.
Sort: