Nvidia has unveiled a new foundational model for gaming agents. NitroGen is an open visual-action model trained on 40,000 hours of gameplay footage sourced from over 1,000 different games. The research team tapped into an underutilized data reservoir: publicly available videos from YouTube and Twitch that clearly display controller overlays. By leveraging template matching and a finely tuned SegFormer model, they successfully extracted player inputs directly from these recordings.
Built upon Nvidia's GR00T N1.5 robotics foundation model, NitroGen represents a significant breakthrough. Researchers highlight it as the first demonstration that robotics-based foundation models can function effectively as general-purpose agents across diverse virtual environments—spanning various physics engines and graphical styles. The model demonstrates strong performance across multiple game genres, including action RPGs, platformers, and roguelikes. When tested in unfamiliar games, NitroGen achieves up to a 52% higher success rate compared to models trained from scratch.
The collaborative research effort includes contributors from Nvidia, Stanford University, Caltech, and other institutions. As part of their commitment to open science, the team has publicly released the dataset, model weights, research paper, and source code.