Lightricks Open-Sources AI Video Model LTX-2 to Challenge Sora and Veo

2026-01-12


Lightricks, an Israeli AI startup, has open-sourced its 19-billion-parameter model LTX-2, a unified system capable of generating synchronized audiovisual content from text prompts. The company claims the model significantly outperforms existing solutions in both speed and multimodal coherence.

According to the technical documentation, LTX-2 can produce up to 20 seconds of high-fidelity video with spatial audio from a single text input. This includes lip-synced speech, ambient sound effects, foley, and scene-adaptive background music—all generated simultaneously. The full version supports output resolution up to 4K and frame rates as high as 50 fps.

The research team highlights fundamental limitations in current audiovisual generation approaches. Most systems operate sequentially—either generating video first then adding audio, or vice versa. Such decoupled pipelines fail to capture the true joint distribution between modalities. While lip movement is driven by audio, acoustic environments are visually conditioned. Only an integrated architecture can model these bidirectional dependencies effectively.

Why Asymmetric Architecture Matters for Audiovisual Synthesis

LTX-2 is built on an asymmetric dual-stream transformer with a total of 19 billion parameters. The video stream accounts for 14 billion parameters, substantially larger than the 5 billion allocated to the audio stream. This imbalance reflects the differing information density across modalities, according to the developers.

Each stream processes its modality using separate variational autoencoders. This separation enables modality-specific positional encodings: 3D rotary position embeddings (RoPE) for spatiotemporal video structure, and 1D RoPE for the purely temporal nature of audio. Bidirectional cross-attention layers bridge the two streams, precisely aligning visual events with corresponding sounds—such as a ball hitting the ground.


For text understanding, LTX-2 leverages Gemma3-12B as its multilingual encoder. Rather than relying solely on the final layer of the language model, it aggregates features across all decoder layers. The system also employs "thinking tokens"—additional placeholders in the input sequence that provide extended processing space for complex prompts before generation begins.

Performance Gains Put LTX-2 Ahead of Rivals

Benchmark results show LTX-2 excels in inference efficiency. On an Nvidia H100 GPU, it takes just 1.22 seconds per step to generate 121 frames at 720p resolution. In contrast, the comparable Wan2.2-14B model (which produces silent video only) requires 22.30 seconds for the same output—making LTX-2 approximately 18 times faster.

The maximum 20-second duration also surpasses competing models: Google’s Veo 3 reaches 12 seconds, OpenAI’s Sora 2 manages 16 seconds, and Character.AI’s open Ovi model caps at 10 seconds. In human evaluation studies, LTX-2 was rated “significantly better” than open alternatives like Ovi and achieved performance on par with proprietary leaders such as Veo 3 and Sora 2.

However, the team acknowledges several limitations. Output quality varies by language, with speech synthesis less accurate for underrepresented languages or dialects. In multi-speaker scenes, the model occasionally misassigns dialogue to incorrect characters. Sequences exceeding 20 seconds may suffer from temporal drift and degraded audiovisual synchronization.

Open Release Challenges Closed API Models

Lightricks framed the open release as a critique of the current industry landscape. Founder Zeev Farbman stated in the launch video: “I just don’t see how closed APIs can deliver on the promise of generative video, because they lack the control professionals actually need.” He pointed to a growing gap: while demos look impressive, real-world usability remains limited.

The company also emphasized ethical considerations. “AI should augment human creativity and intelligence,” Farbman added. “But I’m concerned when someone else owns my augmentation.” The vision is to enable users to run AI on their own hardware, under their own terms, and make ethical decisions collectively with a broad creator community—rather than outsourcing those choices to a few powerful entities with conflicting interests.

Beyond model weights, the release includes a lightweight version, multiple LoRA adapters, and a modular training framework supporting multi-GPU setups. Optimized for Nvidia’s RTX ecosystem, LTX-2 runs on consumer-grade GPUs like the RTX 5090 as well as enterprise systems. Model weights and source code are available on GitHub and Hugging Face, with a live demo accessible through free registration on the company’s creative platform.