This article is part of our coverage of the latest advancements in AI research. Humans possess an innate understanding of how the world operates. We expect dropped objects to fall, hidden items to remain existent, and solid bodies not to pass through one another. This "intuitive physics" forms a cornerstone of human cognition. However, replicating this common sense in artificial intelligence remains a significant challenge.
Recent work by researchers at Meta AI demonstrates how a specific type of deep learning model can develop an understanding of intuitive physics simply by watching vast amounts of unlabeled video data.
This breakthrough offers valuable insights into building better world models, marking a critical step toward more advanced and general AI systems. Intuitive physics refers to our fundamental grasp of physical laws — we expect objects to behave predictably, not vanish or reappear suddenly, pass through solid barriers, or change shapes arbitrarily. Such understanding develops early in humans and is even present in many animal species.
Despite rapid progress in solving complex tasks like coding, mathematics, and language generation, current AI systems still struggle significantly with commonsense reasoning about physical principles. This highlights an ongoing gap often referred to as "Moravec's Paradox": what’s trivial for biological organisms can be incredibly challenging for machines.
Two primary approaches are used to instill physics-based understanding in AI. Structured models employ hand-coded representations of objects, their attributes, and spatial relationships, essentially constructing a "game engine" within the AI to simulate physics. This aligns with theories suggesting humans have innate "core knowledge" systems. The alternative involves generative pixel-based models that take a more generalized approach by predicting future video frames directly at the pixel level without relying on predefined structures.
V-JEPA: A Middle Path for Learning Physics Meta AI’s paper explores a third method that strikes a balance between these extremes — Joint Embedding Predictive Architectures (JEPAs). JEPA was first introduced in 2022 by Yann LeCun, Meta’s Chief AI Scientist and co-author of the new study. At its core, JEPA posits that predicting future states should occur within abstract internal representations learned by the model itself rather than through low-level feature prediction or reliance on hand-coded structures. Unlike structured models, JEPA learns its own representations from data.
The study focuses on a video-centric version of this architecture called V-JEPA. The model learns about the world by observing videos and predicting missing parts. Crucially, instead of operating at the pixel level, V-JEPA works within an abstract representation space it has learned, such as how objects interact with each other and their surroundings. Conceptually, V-JEPA consists of two main components: an encoder and a predictor. The encoder analyzes videos and extracts abstract representations of their content. During training, portions of input videos are deliberately masked (e.g., random spatial or temporal blocks or future frames). The predictor's task is to forecast the representations of these occluded sections based on visible parts provided by the encoder. Through this process, the encoder learns to capture essential, predictable information while discarding irrelevant details.
A key advantage of this training method is that it is self-supervised, meaning no human-labeled annotations are required for the video frames. Once V-JEPA completes training on large-scale video datasets, its learned encoder and predictor can be utilized to investigate its grasp of physics without further fine-tuning.
Researchers adopted a technique inspired by developmental psychology known as the “violation-of-expectation” paradigm. In studies involving human infants, they present scenarios — one physically plausible and another impossible (such as an object passing through a solid wall). Longer gaze durations toward impossible events indicate surprise, suggesting the infant understands the violated physical principle.
Similarly, AI models can be shown pairs of videos — one feasible, one implausible. As outlined in the paper: “By prompting the model to imagine the (representation-based) future of the video and comparing its predictions to actual observed outcomes, we obtain a quantitative measure of ‘surprise’ indicative of violations in intuitive physics concepts.” Higher surprise scores for impossible videos suggest the model has learned relevant physical principles. Performance Comparison Between V-JEPA and Other Models
Researchers tested V-JEPA’s grasp of intuitive physics using three benchmark datasets designed to evaluate specific concepts like object permanence (objects continuing to exist when hidden), continuity (movement along connected paths), shape and color constancy, solidity (non-interpenetration of objects), gravity, support, and inertia. They compared V-JEPA with other model categories: a representative pixel-prediction model (VideoMAEv2) and state-of-the-art multimodal large language models like Qwen2-VL and Gemini 1.5 pro, which interpret videos via textual reasoning. The results were striking