World Labs Releases New Marble Model, Showcasing Vision for World Modeling
World Labs, a spatial intelligence startup led by Stanford AI pioneer Fei-Fei Li, has revealed new advancements in its generative model. The system can construct persistent, navigable 3D environments from a single image and text prompt.
The company's mission is to develop "large world models" capable of perceiving, generating, and interacting with 3D physical environments, moving beyond the current AI focus on 2D and language-based systems.
With a new preview version called Marble, users can generate and export these environments, positioning the technology in direct competition with major research labs like Google DeepMind. Compared to previous versions, the model now creates larger, more stylistically diverse worlds with clearer geometric structures.
How Marble Works
World Labs has not disclosed specific details about the model's architecture, and its official blog post offers minimal technical insight. Users simply input a raw image, and the model constructs a virtual world based on that input. An important hint lies in the export format. Generated worlds can be exported as Gaussian splats for use in other projects. This suggests the model utilizes a modern technique—Gaussian splat rendering—for real-time photorealistic scene rendering. At the core of Gaussian splat rendering is a rasterization method that uses millions of 3D Gaussians instead of traditional polygons or triangles to construct scenes. Each Gaussian is defined by its position, scale, color, and opacity. The process typically begins with structure-from-motion (SfM), which generates a 3D point cloud from a series of 2D images. Each point is then converted into a Gaussian and refined through a neural network-like training process, where Gaussians are continuously adjusted, split, or pruned to match the original image. The result is a highly detailed scene representation that can be rendered quickly. What makes Marble unique is its ability to "imagine" content beyond the visible frame from a single image. I had early access to the model and tested it on several images. For instance, when I provided a photo of a modern office space, the model rendered additional desks and meeting rooms beyond the original image (you can explore the virtual world here). This is what World Labs refers to as the "world model" capability. I suspect the model creates latent object representations from the input image and then expands the environment based on data patterns learned during training, generating a complete 3D scene.Applications and Limitations
The current model is designed to generate full 3D environments rather than isolated objects. While not ideal for creating individual characters or animals, it excels at building virtual scenes and stages. Early adopters are already exploring its potential in game asset creation and VR film production, with some reporting that tasks that once took weeks can now be completed in minutes. However, the model is not without challenges. To use it effectively, users must understand the types of data it was trained on. For example, the model performed well when generating an office scene from a real-world image, but when I input a fantasy-style illustration of a tavern, the output was rough and problematic (you can view the result here). This may be due to the illustration style differing from the training data—Marble performs better with static 3D images, likely because it was trained extensively on 3D renderings. Additionally, the further the output deviates from the original image, the less detailed the objects become. Beyond creative applications, this technology has significant implications for training embodied AI agents. By generating realistic and diverse digital twins of the real world, developers can train and test robotic and autonomous vehicle models in simulations. Nvidia is already using neural reconstruction and Gaussian-based rendering to transform real-world driving sensor data into high-fidelity simulations for autonomous vehicle development. These simulations can be integrated into platforms like the CARLA open-source autonomous vehicle simulator to test new scenarios and generate rare edge-case data.What Defines a World Model?
World Labs' approach contrasts with that of competitors like Google DeepMind. World Labs provides a tool that generates explicit, exportable 3D assets (Gaussian splat files), which can then be imported into other applications such as game engines or simulators. DeepMind’s Genie 3, in contrast, is an end-to-end generative world model. It uses an autoregressive architecture to generate and simulate interactive environments in real-time based on text prompts and user actions. In this model, environmental consistency emerges dynamically rather than being derived from a pre-existing 3D structure. The entire world and its interactions exist within a dynamic model, without producing static, exportable 3D assets (Genie 3 is not currently available to the public). This distinction highlights a broader discussion in the AI community about the definition of “world models.” The term currently refers to two distinct concepts. The first, represented by systems like Marble from World Labs and DeepMind’s Genie 3, refers to generative models capable of creating and simulating external environments. These models aim to generate settings where AI agents can be trained or where users can engage in interactive experiences. The second concept describes an internal predictive system used by AI agents to interpret their surroundings. This mirrors how humans and animals operate; instead of predicting pixels, we rely on abstract representations to anticipate possible outcomes. Models like Meta’s Joint Embedding Predictive Architecture (JEPA) are designed for this purpose. They learn latent features governing interactions in the world, enabling agents to make effective predictions and take actions without requiring full, photorealistic simulations. I believe the future of embodied AI may combine both approaches: generative models like Marble will create expansive, complex virtual worlds to train agents equipped with efficient predictive world models like V-JEPA.