Alibaba's New Qwen Model Clones Voice from Three-Second Audio Clip

2025-12-24

Alibaba Cloud's Tongyi Qianwen team has unveiled two new AI voice models capable of generating or cloning speech from text instructions. The first, Qwen 3-TTS-VD-Flash, enables users to create highly customized voices based on detailed descriptions, precisely controlling characteristics such as emotion and speaking speed. For instance, users can generate a "loud baritone from a middle-aged man — energetic, fast-paced TV shopping-style speech with exaggerated intonation and strong sales appeal." According to official reports, this model outperforms OpenAI’s GPT-4o mini-tts interface introduced earlier this spring.



The second model, Qwen 3-TTS-VC-Flash, requires only a three-second audio sample to clone a voice and can reproduce the vocal characteristics in ten different languages. The Tongyi Qianwen team claims that this model achieves a lower error rate compared to competing solutions like ElevenLabs and MiniMax. Additionally, it handles complex texts, simulates animal sounds, and extracts target voices from recordings. Both models are accessible via Alibaba Cloud's API, and demonstration versions are also available for testing on the Hugging Face platform.