Diffusion Models for Reinforcement Learning Survey

TL,DR; Diffusion models are becoming ubiquitous in Reinforcement Learning, providing high sample efficiency due to the data augmentation performed over the training data.


Table of contents:

  1. RL Overview
  2. Diffusion Overview
  3. What does RL + Diffusion look like in practice?
  4. What have been the applications?

Reinforcement Learning: training agents to solve sequential decision making tasks while maximizing a given reward function; mainly used in high-dimensional cases where there is a dynamic environment.

Key terms: RL Planner, RL Policy, RL Data Synthesizer


Struggles in RL:


Expressiveness: doesn't fully capture complex dynamics

off-policy RL faces severe extrapolation errors, especially when predicting on out-of-distribution samples. while helping maintain training stability, this reduces the models representative power

Reinforcement Learning vis Supervised Learning eliminates Q-learning thus reducing extrapolation error; issue is you need to fit entire dataset pointing to the expressiveness of the policy in-use.


Data Efficiency: data scarcity in high-dimensional State Spaces and Experience Replay

Processing a lot of data places limitations on simulation speed; huge state space causes worse results in policy convergence. Current RL SOTA uses data augmentation, but with low fidelity and complexity


Compounding Error: propagated by autoregressive RL planning

long-term errors build up when models that are trained for single-step prediction are forced to mimic multi-step state transitions;

often impacted by data quality and nature of environmental transitions in the subject space.


Multitask Generalization: unable to use one policy for many tasks well


Diffusion Modeling: used for Trajectory Generation on Offline Dataset guided by sampling process; can succeed in the places where RL struggles as RL is more structured and tends to be less expressive.

Prominent formulations: Denoising Diffusion Probabilistic Model, Score-matching Generative Models


Benefits of diffusion modeling:
Expressiveness: replaces conventional Gaussian policies in RL that fail to fix complex distributions. Diffusion models can represent any normalizable distribution and provide improvements directly in performance of constrained, parametrized policy


Benefits of diffusion modeling: (cont.)


What does RL + Diffusion look like in practice?


Three main integrations of diffusion in RL:
![](/img/user/Screenshot 2024-06-14 at 1.10.20 PM.png)


RL Planner: planning is the process of using a dynamic model to make decisions that maximize the reward function.
Guided Sampling Methods
diffusion model optimizes the whole trajectory at each denoising step as opposed to using past partial planned subsequences and using the output for the following step, increasing model horizon.


Fast Sampling Methods


RL Policy: most implementations focus on improving existing model-free RL applications. biggest example is Diffusion-QL which has improved RL issues with Expressiveness and over-conservatism

diffusion provides the ability to model arbitrary action distributions


RL Data Synthesizer: essentially creating guided synthetic data for use in training.

using diffusion to augment dataset with generated data sampled from learned dataset that respect context as opposed to arbitrary permutations that can cause unwanted embeddings in the model.


What have been the applications?

There exist five main tracks: Offline Reinforcement Learning, Online Reinforcement Learing, Imitation Learning, Trajectory Generation, and Data Augmentation