Mistral AI unveils Voxtral Transcribe 2, undercutting competitors on speech recognition pricing. The second-generation speech recognition model starts at $0.003 per minute, with Mistral claiming superior accuracy over GPT-4o mini Transcribe, Gemini 2.5 Flash, and Deepgram Nova. The model family features two variants: Voxtral Mini Transcribe V2 for processing larger audio files, and Voxtral Realtime for applications requiring latency below 200 milliseconds. Priced at twice the rate, Voxtral Realtime employs a proprietary streaming architecture that transcribes audio as it arrives—designed specifically for voice assistants, real-time captions, or call center analytics.
Both models support 13 languages, including German, English, and Chinese. New functionalities encompass speaker diarization, word-level timestamps, and support for recordings up to three hours in length. Voxtral Realtime is accessible as an open-weight version on Hugging Face under Apache 2.0 and via API, while Voxtral Mini Transcribe V2 is available exclusively through Le Chat, the Mistral API, and the Playground. Mistral initially released the first-generation Voxtral version in July 2025.