What are the key points?

Mistral AI launches Voxtral TTS, an open-weight text-to-speech model for non-commercial use. The 4-billion-parameter model supports voice cloning using only 3 seconds of reference audio. Voxtral TTS achieves low-latency performance designed for real-time conversational agents.

Mistral AI Releases Voxtral Open-Weight Text-to-Speech Model

•Mistral AI launches Voxtral TTS, an open-weight text-to-speech model for non-commercial use.
•The 4-billion-parameter model supports voice cloning using only 3 seconds of reference audio.
•Voxtral TTS achieves low-latency performance designed for real-time conversational agents.

Mistral AI has released Voxtral TTS, its first text-to-speech model. Built upon the architecture of the Ministral 3B model, this 4-billion-parameter system is designed to run efficiently on consumer hardware. It supports nine languages, including English, French, and Hindi, and is available under a CC BY-NC 4.0 license, which allows for non-commercial, research, and academic use.

A standout capability of the model is its zero-shot voice cloning, which requires just three seconds of reference audio to capture a speaker's unique characteristics, such as intonation and emotional tone. In human evaluation tests, the model outperformed ElevenLabs Flash v2.5 in a majority of blind assessments.

Designed for real-time use cases, the system achieves a time-to-first-audio of approximately 100 milliseconds. Users can deploy the model locally using quantization (a method to reduce model size for memory efficiency) or access it via the Mistral API for commercial applications. The technical implementation utilizes a hybrid approach, combining semantic token generation and flow matching to separate speech content from voice style.

Mistral AI has released Voxtral TTS, its first text-to-speech model. Built upon the architecture of the Ministral 3B model, this 4-billion-parameter system is designed to run efficiently on consumer hardware. It supports nine languages, including English, French, and Hindi, and is available under a CC BY-NC 4.0 license, which allows for non-commercial, research, and academic use.

A standout capability of the model is its zero-shot voice cloning, which requires just three seconds of reference audio to capture a speaker's unique characteristics, such as intonation and emotional tone. In human evaluation tests, the model outperformed ElevenLabs Flash v2.5 in a majority of blind assessments.

Designed for real-time use cases, the system achieves a time-to-first-audio of approximately 100 milliseconds. Users can deploy the model locally using quantization (a method to reduce model size for memory efficiency) or access it via the Mistral API for commercial applications. The technical implementation utilizes a hybrid approach, combining semantic token generation and flow matching to separate speech content from voice style.