Translatotron 3 is a novel unsupervised speech-to-speech translation architecture that can learn the translation task from monolingual data alone. It addresses the challenge of scarcity of parallel speech data and allows for translation of non-textual speech attributes. The model incorporates pre-training, unsupervised embedding mapping, and a reconstruction loss based on back-translation. It outperforms a baseline cascade system in speech-to-speech translation tasks between Spanish and English.
Sort: