Hibiki is a decoder-only model that performs simultaneous speech-to-speech translation by processing source and target speech synchronously through a multistream language model. The system addresses the challenge of real-time translation by using a weakly-supervised method that leverages perplexity to identify optimal delays and create aligned synthetic data. Hibiki achieves state-of-the-art performance on French-English translation tasks while maintaining speaker fidelity and naturalness, with inference simple enough for batched translation and real-time on-device deployment.

2m read timeFrom arxiv.org
Post cover image

Sort: