Large Language Modeling has revolutionized fields like text and speech generation. This review compares four popular open-source Text-to-Speech (TTS) models: Kokoro, SparkTTS, F5-TTS, and Sesame CSM. Kokoro is lightweight and efficient but lacks voice cloning. SparkTTS offers customizable voices but was found to be less effective in practice. F5-TTS excels in quality and voice cloning. Sesame CSM, despite impressive demonstrations, does not match F5 qualitatively. Overall, F5 is recommended as the best TTS model.
Table of contents
KokoroSparkTTSF5-TTSRun F5-TTS on GPU DropletsSesame CSMChoosing the Best Model for TTSSort: