Demystifying the Many Deep Reinforcement Learning Algorithms


In recent years there has been an explosion in Deep Reinforcement learning research resulting in the creation of many different RL algorithms that work with deep networks. In DeepRL and RL in general the goal is to optimize a policy \(\pi(a|s,\theta)\) with parameters \(\theta\) with respect to the future discounted reward.

$$ J(\pi) = \mathbb{E} [\sum_\limits{t=0}^{T} \gamma^{t} r_{t}] $$

It can be difficult to keep track of the many algorithms let alone their properties and when it is best to use which one.

In this post I am make an effort to organize a number of RL methods into a few groups. This organization helps clear up some misconceptions of different algorithms and demystifies what these properties really mean, for example, on-policy vs off-policy.

Algorithms and Properties

Here a table of popular RL algorithms is given.

Name on/off policy action space support exploration method
TRPO on-policy discrete and continuous e-greedy and Gaussian noise
DDPG off-policy continuous only Ornstein–Uhlenbeck process
A3C on-policy discrete and continuous e-greedy and Gaussian noise
CACLA on-policy continuous e-greedy and Gaussian noise
DQN off-policy discrete e-greedy
A2C on-policy discrete and continuous e-greedy and Gaussian noise
Q-Prop on-policy discrete and continuous e-greedy and Gaussian noise
PPO on-policy discrete and continuous e-greedy and Gaussian noise
CEM on-policy discrete and continuous e-greedy and Guassian noise


Exploration Method:

  • e-greedy: Is primarily used for discrete action domains. The main idea being that a random action should be chosen with a non-zero probability.

Original action selection probability

with epsilon greedy
- Gaussian: Is typically for continuous action domains and one of a few methods used. Many method using the neural network to model the mean of a number of Gaussian distributions that may have a fixed std.
- Ornstein–Uhlenbeck This is a different kind of exploration noise that is based on "the velocity of a massive Brownian particle under the influence of friction". Often used for Q-learning with DDPG.

One property of RL algorithms I still rack my brain is on-policy vs off-policy. I often find the line between these two very fuzzy. While it is known that any method that uses an action-valued function is off policy, some methods that claim to be on-policy are used in ways that seem very off-policy. In some papers the amount of data collected before performing training updates is in the 100,000s, far more than anything I have ever used for off-policy learning. It is has also been shown that keeping around data from the last few data collection cycles is beneficial. We also seem to gloss over the fact that we do not train the policy and value function to convergence between data collection cycles. Therefore, in practise we are always using an off-policy critic.

Learning algorithms

Continuous Actor Critic Learning Automaton (CACLA)

CACLA can be thought of as a simpler policy learning algorithm that was developed before some of the recent formal mathematical methods. A critic (value function) and policy are learned at the same time (aka actor-critic method). CACLA uses the critic as a score function to decide what actions the policy should learning from. In this case it is used as an indicator function, an action that performs better than the current estimate of the policies future discounted reward is given to the policy to learn from. The policy uses MSE loss to fit the policy network to these better than expected actions.


  1. CACLA seems to work very well with neural networks. Being able to use a simple MSE loss for the policy makes SGD very stable.


  1. Can't addjust the amount of exploration the policy should perform.
  2. Original method only uses single step returns to determine if reward for action is above average.

Asynchronous Actor Critic (A3C)

A3C can be thought of as a modification of CACLA in at least two ways. The MSE loss is traded with a log likelihood loss and the score function is exchanged for a multi-step return estimation. Switching to a log likelihood loss allows us to model the distribution of action better, including the std of the policy. In the case of continuous actions the policy is modelled as a vector of Gaussian distributions. This in combination with the new score function that allows for a better trade-off between bias of the value function and the variance in Monte-Carlo samples from the simulation allows us to include a loss for the standard deviations in the policy. Now we can update the amount of exploration we think the policy should be performing in a on-policy state dependant manner.


  1. Using on-policy method with log likelihood loss allows the exploration to be learned.
  2. Method can be parallelized well.


  1. In practise stable exploration is still a challenge. An Entropy term is added to the loss to help increase exploration.
  2. The policy mean can still change dramatically between training updates, making it challengine to converge on a good policy.

Deep Deterministic Policy Gradient (DDPG)

DDPG is a slightly different framework more like DQN. Instead of using a score function to push the policy in the direction of actions with higher reward, instead a action-valued Q-function is used. This action-valued function \(q(s,a)\) can be used to compute the direction to change the current action in order to increase the overall future discounted reward. Simply, the method maximizes \(q(s,\pi(a|s))\). However, this method does not account for how exploration should be performed. Approximating a Q-function is done off-policy allowing the Q-function to learn the value of taking actions in states the policy might not explore and in doing so better estimate the value over all actions. This property should theoretically allow for the learning of a Q-function that encodes the best action to take in any state. In practise this model learns estimates over the current policy mean in order to improve the policy.


  1. Can work very well for high dimensional continuous action problems.
  2. Learns an action-valued function instead of a state-valued function.


  1. Does not explicitly provide a formalism for the exploration to be used.
  2. Needs a target network to increase stability while learning an action-valued function.
  3. The policy is only as good as the Q-function is accurate...

Trust Region Policy Optimization (TRPO)

TRPO is a great algorithm with some very good theoretical properties. Stochastic policy optimization can be challenging because the policy distribution can change quickly after network updates. This in combination with the properties of deep neural networks, in particular that deep nets like to receive nice small gradients, make learning challenging. When the stochasticity grows so do the gradients, when the gradients grow so does the stochasticity of the policy. TRPO uses conjugate gradient decent with line search to optimize the policy parameters \(\theta\) with respect to a constraint on the KL divergence. The KL divergence constraint assist in keeping the stochastic distribution the policy models from changing to much between updates.


  1. TRPO can work very well and is known to even exhibit monotonic policy improvements.


  1. Challenging (impossible?) to use TRPO for things like recurrent neural networks.
  2. Lots of memory is needed to compute the Fisher vector product (AKA gradient vector product) of the parameters \(\theta\)
  3. Not easily parallelizable.

Generalize Advantage Estimation (GAE)

While not an RL algorithm GAE is used to reduce the variance in the advantage estimates for the policy. TRPO uses a KL divergence penalty to reduce the amount the policy distribution can shift between updates, however there can still be a significant variance in the advantage estimate.


Here we can see a set of states labelled with their future discounted reward;advantage (from samples);baseline. The True policy gradient is visualized with the ellipses. One of the challenges with estimating the Advantage is that it is based on the future discounted reward. The agent could have randomly found some good reward for a few timesteps and then wondered into a very low reward area of the state space. This will result in a very low overall future discounted reward and a high Advantage for the states the agent explored in the high reward area. This advantage is good but for practise reasons having widely different advantage values can lead to large and noisy gradients.

GAE reduces the variance by mixing the advantage estimates from just the data with estimates computed from the returns using the value function.

Proximal Policy Optimization (PPO)

PPO was born from work that was done to combat the weaknesses of TRPO while keeping the monotonic behaviour. PPO uses stochastic gradient descent instead of conjugate gradient descent to optimize the policy parameters. Instead of a hard constraint on the KL divergence a soft constraint is used. This constraint is implemented in the form of a boundary violation penalty or a clipping term (the official version). PPO has been shown to have close to, or better performance than TRPO.


  1. Parallelizable.
  2. Does not use as much memory as TRPO


  1. KL can still change significantly between updates. Some network tuning could be necessary

Q-Prop (Q-Prop)

It would be really nice if we could combine the good parts of off-policy learning with those of on-policy learning. Q-Prop is one of a few works in this area. The goal is to combine the gradients from DDPG with unbiased on-policy data. Methods like DDPG can never be extremely accurate because the gradients passed to the actor come from an imperfect deep network with its own error. To get around this it is better to directly use data samples to reduce this error. Q-Prop does this by using an additional estimate of the advantage function (as the Q-function from DDPG) as a control variate to reduce the advantage variance from GAE.

$$\Delta_{\theta} J(\theta) = E_{\rho \pi}[\Delta_{\theta}~log~\pi(a|s,\theta)(\hat{A}(s_t,a_t) - \bar{A}_w(s_t, a_t)] + E_{\rho \pi} [\Delta_a Q_w(s_t, a)|_a=\mu(s_t,\theta)\Delta_\theta \mu(s_t, \theta)]$$

This estimate works to further reduce the advantage variance, However, it might not be so clear that this method helps.


There has been a great amount of RL research that shows RL has promise and can find high value solutions. However, RL still suffers from a tenancy to get trapped in local minima. Some believe this is related to basic models we use for policy distributions. In this area some people have used parameter space noise exploration in stead of action space (Parameter Space Noise for Exploration,Noisy Networks for Exploration). Others are doing work on better frameworks to exploit the data better. Still more work using Stochastic Value Gradients learns a model to optimize the stochastic bellman equation. A Distributional Perspective on Reinforcement Learning focuses on learning an approximate value distribution. A similar method has been applied to DDPG as well Distributed Distributional Deterministic Policy Gradients

There are still many more RL algorithms than the ones mentioned here. These are the ones that are popular and I am more familiar with. This post has followed a road map of how RL research has progressed, and why. There are many other aspects to consider when using an RL algorithm, such as: how difficult is the problem to be solved, how many dimensions is the action space, does the state include images, etc. If you are interested in using an RL algorithm I suggest picking one off this list and possibly research some of the related papers that may have made small modifications to the algorithm. Also, checking to papers/works that have solved a problem similar to part of what you are working on is a good start.