Ai2 Launches Open AI Model to Empower Robot Action Planning in 3D Space

2025-08-12

Seattle-based Allen Institute for Artificial Intelligence (Ai2) today announced the launch of MolmoAct 7B, an open-source embodied AI model that enables robots to "think" before executing physical actions, significantly enhancing their decision-making capabilities. This 7-billion-parameter model represents a paradigm shift in robotic intelligence by introducing action reasoning capabilities that transcend traditional visual language models. Unlike conventional robotics models that rely on visual inputs to interpret environments - such as analyzing images for furniture assembly guidance or executing simple object manipulation tasks - MolmoAct introduces Ai2's novel Action Reasoning Model (ARM) architecture. This innovative framework transforms natural language commands into precise physical action sequences through 3D spatial mapping and trajectory planning, as explained by computer vision lead Ranjay Krishna: "Once it perceives the environment, the model generates a 3D representation and formulates a motion plan before executing any physical movement." The model was trained on a curated 18-million-sample dataset comprising real-world scenarios from kitchens and bedrooms, focusing on goal-oriented tasks like bed-making and laundry folding. This contrasts with industry competitors who often employ black-box approaches - such as NVIDIA's GR00T-N2-2B (trained on 6-million samples across 1024 H100 GPUs) or Physical Intelligence's pi-zero - by providing full transparency through open-source code, weights, and evaluation metrics. Key innovations include pre-execution trajectory visualization allowing users to modify planned actions via natural language instructions or touchscreen adjustments. When evaluated on SimPLER benchmark tests simulating common household tasks, MolmoAct achieved 72.1% success rate, outperforming models from Physical Intelligence, Google, Microsoft, and NVIDIA. CEO Ali Farhadi emphasized this breakthrough as "not just releasing a model, but laying the foundation for AI's new era in physical world applications," with Krishna adding, "Our mission is real-world deployment - anyone can download and customize our model for specific purposes." This open approach addresses industry transparency challenges by eliminating the opaque transformer architectures typical in commercial solutions. The model's training infrastructure utilized 256 NVIDIA H100 GPUs for initial pre-training (completed in one day) and 64 GPUs for fine-tuning (2 hours), representing a more efficient training methodology compared to existing solutions.