Planning with Diffusion for Flexible Behavior Synthesis

TL;DR:

"The core contribution of this work is a denoising diffusion model designed for trajectory data and an associated probabilistic framework for behavior synthesis."

Available:


Planning in Reinforcement Learning

Planning in Reinforcement Learning is another way of saying trajectory generation; what sequence of generated steps can an agent take with respect to the environment model while maximizing the accumulated reward.


With respect to Offline Reinforcement Learning, planning requires a learned model of the environmental dynamics. Issue arises when the learned model is not well attuned to the task at hand. Very poor inter-task ability is displayed as policies are tuned to specific executions of tasks that do not generalize well to new environments.

Further loss is incurred due to a series of factors as discussed in the primer for Diffusion Models for Reinforcement Learning Survey.


Planning with Diffusion


Architecture:


Learned long-horizon planning

Planning with diffusion removes single step updates of the trajectories and instead relies on "hindsight experience replay."


Greatest strength is this is especially viable in "sparse reward space" environments wherein the typical shooting-based trajectory optimization and planning methods lose performance in.
Screenshot 2024-07-26 at 1.02.26 PM.png


Temporal compositionally (basically locality)

Trajectories are optimized across every step all at once, and a nice property that arises is the maintenance of local fidelity in the paths generated from the denoising procedure.


Screenshot 2024-07-26 at 12.59.24 PM.png


Task compositionality

Further, beyond doing well in sparse reward environments, diffuser is independent of the reward function as it acts as a prior over feasible paths.


Screenshot 2024-07-26 at 1.17.08 PM.png


Screenshot 2024-07-26 at 1.01.35 PM.png


Experiment: Planning as Inpainting


Screenshot 2024-07-26 at 1.38.25 PM.png


Screenshot 2024-07-26 at 1.38.49 PM.png


How do we improve? AdaptDiffuser