Planning with Diffusion for Flexible Behavior Synthesis

TL;DR:

"The core contribution of this work is a denoising diffusion model designed for trajectory data and an associated probabilistic framework for behavior synthesis."

Available:

Example Code

Planning in Reinforcement Learning

Planning in Reinforcement Learning is another way of saying trajectory generation; what sequence of generated steps can an agent take with respect to the environment model while maximizing the accumulated reward.

With respect to Offline Reinforcement Learning, planning requires a learned model of the environmental dynamics. Issue arises when the learned model is not well attuned to the task at hand. Very poor inter-task ability is displayed as policies are tuned to specific executions of tasks that do not generalize well to new environments.

Further loss is incurred due to a series of factors as discussed in the primer for Diffusion Models for Reinforcement Learning Survey.

Planning with Diffusion

"The denoising procedure can be seen as parameterizing the gradients of the data distribution;"
- Much finer control over planning
By blurring the line between model and planner, the training procedure that improves the model’s predictions also has the effect of improving its planning capabilities.

Architecture:

an entire trajectory should be predicted non-autoregressively
each step of the denoising process should be temporally local
the trajectory representation should allow for equivariance along one dimension (the planning horizon) but not the other (the state and action features).

Learned long-horizon planning

Planning with diffusion removes single step updates of the trajectories and instead relies on "hindsight experience replay."

Future goal states are included within a given step; As such the training process is not Markovian, future actions impact the past over several iterations of denoising the planned trajectories.

Greatest strength is this is especially viable in "sparse reward space" environments wherein the typical shooting-based trajectory optimization and planning methods lose performance in.
Screenshot 2024-07-26 at 1.02.26 PM.png

Temporal compositionally (basically locality)

Trajectories are optimized across every step all at once, and a nice property that arises is the maintenance of local fidelity in the paths generated from the denoising procedure.

This results in locally viable structures propagating throughout the trajectory increasing accuracy. This results in performing trajectories that are not within the training data through composing different paths.

Screenshot 2024-07-26 at 12.59.24 PM.png