Diffusion Models for Reinforcement Learning Survey

#compsci

TL,DR; Diffusion models are becoming ubiquitous in Reinforcement Learning, providing high sample efficiency due to the data augmentation performed over the training data.

Table of contents:

Reinforcement Learning: training agents to solve sequential decision making tasks while maximizing a given reward function; mainly used in high-dimensional cases where there is a dynamic environment.

Key terms: RL Planner, RL Policy, RL Data Synthesizer

Struggles in RL:

Expressiveness: doesn't fully capture complex dynamics

off-policy RL faces severe extrapolation errors, especially when predicting on out-of-distribution samples. while helping maintain training stability, this reduces the models representative power

Reinforcement Learning vis Supervised Learning eliminates Q-learning thus reducing extrapolation error; issue is you need to fit entire dataset pointing to the expressiveness of the policy in-use.

Data Efficiency: data scarcity in high-dimensional State Spaces and Experience Replay

Processing a lot of data places limitations on simulation speed; huge state space causes worse results in policy convergence. Current RL SOTA uses data augmentation, but with low fidelity and complexity

Compounding Error: propagated by autoregressive RL planning

long-term errors build up when models that are trained for single-step prediction are forced to mimic multi-step state transitions;

often impacted by data quality and nature of environmental transitions in the subject space.

Multitask Generalization: unable to use one policy for many tasks well

changing reward function for different purpose requires retraining
Online Multitask RL implementations fail due to conflicting gradients and sample efficiency

Diffusion Modeling: used for Trajectory Generation on Offline Dataset guided by sampling process; can succeed in the places where RL struggles as RL is more structured and tends to be less expressive.

Prominent formulations: Denoising Diffusion Probabilistic Model, Score-matching Generative Models

Benefits of diffusion modeling:
Expressiveness: replaces conventional Gaussian policies in RL that fail to fix complex distributions. Diffusion models can represent any normalizable distribution and provide improvements directly in performance of constrained, parametrized policy

Benefits of diffusion modeling: (cont.)

Data Efficiency through Data Augmentation
Compounding Error: replaces autoregressive planning step
Multitask Generalization: latent embeddings capture phenomena

What does RL + Diffusion look like in practice?

Three main integrations of diffusion in RL:
![](/img/user/Screenshot 2024-06-14 at 1.10.20 PM.png)

RL Planner: planning is the process of using a dynamic model to make decisions that maximize the reward function.
Guided Sampling Methods
diffusion model optimizes the whole trajectory at each denoising step as opposed to using past partial planned subsequences and using the output for the following step, increasing model horizon.

Fast Sampling Methods

RL Policy: most implementations focus on improving existing model-free RL applications. biggest example is Diffusion-QL which has improved RL issues with Expressiveness and over-conservatism

diffusion provides the ability to model arbitrary action distributions

RL Data Synthesizer: essentially creating guided synthetic data for use in training.

using diffusion to augment dataset with generated data sampled from learned dataset that respect context as opposed to arbitrary permutations that can cause unwanted embeddings in the model.

What have been the applications?

There exist five main tracks: Offline Reinforcement Learning, Online Reinforcement Learing, Imitation Learning, Trajectory Generation, and Data Augmentation