Diffusion Models for Reinforcement Learning Survey
TL,DR; Diffusion models are becoming ubiquitous in Reinforcement Learning, providing high sample efficiency due to the data augmentation performed over the training data.
Table of contents:
- RL Overview
- Diffusion Overview
- What does RL + Diffusion look like in practice?
- What have been the applications?
Reinforcement Learning: training agents to solve sequential decision making tasks while maximizing a given reward function; mainly used in high-dimensional cases where there is a dynamic environment.
Key terms: RL Planner, RL Policy, RL Data Synthesizer
Struggles in RL:
Expressiveness: doesn't fully capture complex dynamics
off-policy RL faces severe extrapolation errors, especially when predicting on out-of-distribution samples. while helping maintain training stability, this reduces the models representative power
Reinforcement Learning vis Supervised Learning eliminates Q-learning thus reducing extrapolation error; issue is you need to fit entire dataset pointing to the expressiveness of the policy in-use.
Data Efficiency: data scarcity in high-dimensional State Spaces and Experience Replay
Processing a lot of data places limitations on simulation speed; huge state space causes worse results in policy convergence. Current RL SOTA uses data augmentation, but with low fidelity and complexity
Compounding Error: propagated by autoregressive RL planning
long-term errors build up when models that are trained for single-step prediction are forced to mimic multi-step state transitions;
often impacted by data quality and nature of environmental transitions in the subject space.
Multitask Generalization: unable to use one policy for many tasks well
- changing reward function for different purpose requires retraining
- Online Multitask RL implementations fail due to conflicting gradients and sample efficiency
Diffusion Modeling: used for Trajectory Generation on Offline Dataset guided by sampling process; can succeed in the places where RL struggles as RL is more structured and tends to be less expressive.
Prominent formulations: Denoising Diffusion Probabilistic Model, Score-matching Generative Models
Benefits of diffusion modeling:
Expressiveness: replaces conventional Gaussian policies in RL that fail to fix complex distributions. Diffusion models can represent any normalizable distribution and provide improvements directly in performance of constrained, parametrized policy
Benefits of diffusion modeling: (cont.)
- Data Efficiency through Data Augmentation
- Compounding Error: replaces autoregressive planning step
- Multitask Generalization: latent embeddings capture phenomena
What does RL + Diffusion look like in practice?
Three main integrations of diffusion in RL:
![](/img/user/Screenshot 2024-06-14 at 1.10.20 PM.png)
RL Planner: planning is the process of using a dynamic model to make decisions that maximize the reward function.
diffusion model optimizes the whole trajectory at each denoising step as opposed to using past partial planned subsequences and using the output for the following step, increasing model horizon.
RL Policy: most implementations focus on improving existing model-free RL applications. biggest example is Diffusion-QL which has improved RL issues with Expressiveness and over-conservatism
diffusion provides the ability to model arbitrary action distributions
RL Data Synthesizer: essentially creating guided synthetic data for use in training.
using diffusion to augment dataset with generated data sampled from learned dataset that respect context as opposed to arbitrary permutations that can cause unwanted embeddings in the model.
What have been the applications?
There exist five main tracks: Offline Reinforcement Learning, Online Reinforcement Learing, Imitation Learning, Trajectory Generation, and Data Augmentation