Model Predictive Control (MPC), also known as rolling horizon control, is an advanced artificial intelligence technique. By leveraging dynamic models and planners, it maximizes an objective function within a planning horizon to select the optimal action. The flexibility of MPC allows it to adapt to various novel reward functions during testing, in stark contrast to policy learning methods that focus on fixed rewards.
To further enhance the performance of MPC, researchers have introduced diffusion models. These models can learn world dynamics and suggest action sequences from offline data, thereby optimizing the MPC decision-making process. Among these, the "Sample, Score, and Rank" (SSR) method offers a simple and effective alternative for optimizing action selection, avoiding the need for more complex optimization techniques.
Model-based approaches, such as the Dynastyle technique, aim to learn strategies both online and offline, whereas MPC methods utilize models for runtime planning. Recent diffusion methods like Diffuser and Decision Diffuser have applied joint trajectory models to predict state-action sequences, further enhancing MPL's capabilities. These methods increase flexibility by decomposing dynamics and action suggestions and improve the ability to adapt to new environments and rewards through multi-step diffusion modeling for trajectory-level predictions.
Building on this, researchers at Google DeepMind proposed Diffusion Model Predictive Control (D-MPC). This online MPC method combines multi-step action suggestions and dynamic models with diffusion models. In the D4RL benchmark tests, D-MPC outperforms existing model-based offline planning methods and rivals state-of-the-art reinforcement learning approaches. More importantly, D-MPC can adapt to new dynamics and optimize new reward functions in real-time.
The key components of D-MPC include multi-step dynamics, action suggestions, and an SSR planner. Each of these elements is effective on its own, but when combined, they demonstrate robust performance. During the planning process, the D-MPC system alternates between taking actions and using the planner to generate the next series of actions. The SSR planner samples multiple action sequences and evaluates them using a learned model, ultimately selecting the best option.
Experimental evaluations show that D-MPC excels in several aspects. Compared to offline MPC methods, D-MPC shows enhanced performance while also adapting well to new reward functions and dynamics. In tests on D4RL driving, Adroit, and Franka Kitchen tasks, D-MPC outperforms methods like MBOP and is on par with approaches such as Diffuser and IQL. Notably, D-MPC exhibits strong generalization capabilities in the presence of reward and hardware defects, with performance significantly improving after fine-tuning.
Ablation studies indicate that using multi-step diffusion models for action suggestions and dynamic predictions significantly enhances long-term prediction accuracy and overall task performance compared to single-step or Transformer models. This finding further validates the effectiveness and advancement of D-MPC.
In summary, D-MPC significantly enhances MPC performance by utilizing diffusion models for multi-step action suggestions and dynamic predictions. In D4RL benchmark tests, D-MPC not only surpasses current model-based planning methods but also competes effectively with state-of-the-art reinforcement learning techniques. Although D-MPC performs excellently in adapting to new rewards and dynamics at runtime, it is slower than reactive policies due to the need for replanning at each step. Future work will focus on accelerating sampling and using latent representation techniques to extend D-MPC to handle pixel observation data, further advancing artificial intelligence technology.