The new voice control feature is built upon the recently launched synchronous video-audio generation model, Kling 2.6. Similar to Google's Veo 3 or Sora 2, this model can generate sound effects that match the visual content, including speech and music.
Kling AI states that the functionality supports various types of human voices—such as speaking, dialogue, narration, singing, and rapping—and is also capable of handling ambient noise and synthesizing scene-specific sounds. The model accepts both plain text descriptions and combinations of text with images as input.
Kling AI has demonstrated a wide range of application scenarios: product demos, lifestyle vlogs, news reporting, sports commentary, documentaries, interview formats, short dramas, and musical performances—including solo singing and even polyphonic choral arrangements.
Custom Voice Training Enhances Character Consistency
The updated voice control allows users to upload their own voice samples to train the model, or directly import audio files. Trained or uploaded voices can then be applied across text-to-video creations.
This significantly improves character consistency—characters in generated videos can now speak in a distinct and recognizable voice, enabling the creation of unified personas across multiple video clips.
Enhanced Motion Control for Complex Movements
The second major upgrade focuses on motion control. According to Kling AI, the system now captures full-body movements with greater precision. Even fast and intricate actions such as martial arts or dance routines are processed more accurately than before.
The company specifically highlights improvements in two areas traditionally challenging for AI-generated video: hand gestures appear sharper and free from blurring, while facial expressions and lip-syncing remain natural and lifelike.
Users can upload motion references between 3 to 30 seconds long to create seamless sequences. Scene details can additionally be fine-tuned using text prompts.
Social media already features impressive examples showing how AI creators are capitalizing on these accessible tools, especially as platform algorithms favor engaging short-form content. As a result, the volume of AI-generated video is set to keep growing, alongside the emergence of genuinely creative applications.
Competitive Pricing Strategy
Besides its native platform, Kling is available through third-party services like Fal.ai, Artlist, and Media.io. API pricing on these platforms ranges from approximately $0.07 to $0.14 per second of generated video—offering strong market competitiveness. Costs vary based on generation speed, duration, and resolution. KlingAI itself operates on a credit-based system.
In early December, Kuaishou unveiled Video O1, which the company describes as the "world’s first unified multimodal video model," integrating both video generation and editing into one framework. Video O1 enables text-based editing of existing videos, allowing changes to the main subject, weather conditions, or overall visual style.
With these latest Kling 2.6 capabilities, Kuaishou is positioning itself competitively against Western tech leaders like Google, OpenAI, and Runway, as well as domestic rivals such as HaiLo, Seedance, and Vidu.
Kuaishou operates Kwai, one of the world’s largest short-video platforms and a direct competitor to TikTok. This gives the company immediate access to vast amounts of synchronized video-audio pairs and real-world motion data, which are instrumental in training models for realistic audiovisual and movement synthesis.