Meta Releases J-VEPA 2 AI Model for Video Understanding

2025-06-12

Meta Platforms Inc.'s AI research division has unveiled a new artificial intelligence model today. This model enhances the training and comprehension of robots and AI agents in the physical world by interpreting video data, similar to how humans perceive their surroundings.

The model is named J-VEPA 2, short for Joint Video Embedding Prediction Architecture. It builds on the company's prior work with J-VEPA and enables AI agents and robots to "think before they act."

"As humans, we often believe language is crucial for intelligence, but that’s not necessarily true," said Yann LeCun, Meta’s VP and Chief AI Scientist. "Humans and animals navigate the world by constructing mental models of reality. What if AI could develop this common sense and predict future events within an abstract space?"

According to Meta, this cutting-edge AI world model is trained using videos, allowing robots and other AI systems to comprehend the physical world and anticipate the outcomes of actions.

A world model helps AI agents and robots create representations of the physical world and understand the consequences of actions, enabling them to plan routes for specific tasks. With world models, companies or organizations can avoid conducting millions of real-world trials because these models can simulate environments for AI systems—often within minutes—to train them based on an understanding of how the world operates.

World models can also be applied to predicting what will happen after a certain action is taken. This allows robots or AI connected to sensors to foresee potential next events. Humans frequently use such planning when navigating unfamiliar places, like avoiding pedestrians while walking or positioning themselves during a game of ice hockey.

AI systems can leverage this form of planning to prevent workplace accidents by guiding robots along safe paths as they interact with other machines and humans, thereby reducing potential hazards.

J-VEPA 2 aids AI agents in comprehending the physical world and its interactions by analyzing patterns in how people engage with objects, how objects move in the physical realm, and how objects interact with one another.

The company reported that when this model was deployed on robots in its lab, it allowed them to effortlessly perform tasks such as reaching, picking up objects, and placing them in new locations.

"Undoubtedly, world models are essential for autonomous vehicles and robotics," stated LeCun. "In fact, we believe world models will usher in a new era of robotics, empowering real-world AI agents to assist with household chores and physical tasks without requiring extensive robot training data."

In addition to releasing J-VEPA 2, Meta has introduced three new benchmarks for the research community to evaluate existing reasoning models that interpret the world through video.