# Introduction:

Less than three years after the publication of Deep Mind’s publication ‘Playing Atari with Deep Reinforcement Learning’ the practical impact of this method on RL literature has been profound, as evidenced by the above graphic. However, the theoretical limitations of the original method haven’t been thoroughly investigated. As I will show, such an analysis actually clarifies the evolution of DQN and highlights which research directions are worth prioritising.

# Background on DQN:

The main idea behind Deep Q-learning, hereafter referred to as DQN, is that given actions $a \in \mathcal{A}$ and states $x \in X$ in a Markov Decision Process(MDP), it’s sufficient to optimise action selection with respect to the expected return:

In particular the aim is to approximate a parametrised value function $Q(x,a;\theta_t)$ where estimation is shifted towards the target:

In addition, epsilon-greedy approaches are used for exploration and to avoid estimates that merely reflect recent experience the authors of DQN regularly allow the network to perform experience replay: batch updates based on less recent experience.

Given the above description of DQN, we may note the following:

1. Selection and evaluation in DQN is done with respect to the same parameters $\theta_t$.
2. Assuming that variance is unavoidable, the $\max$ operator in (2) leads to over-optimistic estimates.
3. The expression in (1) provides an asymptotic guarantee which implicitly requires an ergodic MDP.

These issues shall be addressed in the sections that follow.

# Asymptotic nonsense or the data-inefficiency of DQN:

In the simple case of i.i.d. data $X_i$ if $S_n = \sum_{i=1}^{n} X_i$ and $\mathbb{E}[X_i] = \mu$, a simple application of Chebyshev’s inequality gives:

Essentially, this inequality shows that even in simple scenarios convergence in expectation requires a lot of data and the rate of convergence depends on the variance $\sigma$. Furthermore, we must note that this inequality ignores the following facts:

1. For fixed $(x,a)$, $Q_{\pi}(x,a)$ is rarely unimodal in practice.
2. $Q_{\pi}(x,a)$ rarely has negligible variance.
3. Our data is sequential and hardly ever i.i.d.

From these points it follows that important estimation errors are unavoidable but as I will show, this isn’t the main problem.

# The unreasonable optimism of DQN:

1. Over-optimism with respect to estimation errors:

The authors in [3] highlight that in (2), evaluation of the target $Y_t^Q$ and action selection are done with respect to the same parameters $\theta_t$ which over-optimistic value estimates more likely with respect to the $\max$ operator. This suggests that estimation errors of any kind are more likely to result in overly-optimistic policies.

While this is problematic, the authors of [3] discovered the following elegant solution:

The resulting method, known as Double DQN, essentially decouples selection and evaluation by using two sets of weights $\theta$ and $\theta'$.

2. Over-optimism with respect to risk regardless of estimation error:

Consider the classic problem in decision theory of having to choose between an envelope $A$ which contains $90.00 and envelope $B$ which contains$200.00 or \$0.00 with equal probability. Although $Var[A] \ll Var[B]$, our agent’s ignorance of the bimodality of $B$ would lead it to act in an over-optimistic fashion. Due to the $\max$ operator it would make a decision solely based on the fact that $\mathbb{E}[B] > \mathbb{E}[A]$.

The above problem clearly requires a very different perspective.

Two papers which address the second problem are [5] and [7]. While I won’t go into either paper in any detail I would recommend that the reader start with [5] which provides an elegant and scalable solution with what can be thought of as a data-dependent version of dropout [8]. The consideration of value distributions helps reduce uncertainty and improve inference.

# The latent value of hierarchical models:

Perhaps the most important question when considering the evolution of DQN is how will these agents develop rich conceptual abstractions that will allow scientific induction or generalisation. Although one can argue that a DQN learns good statistical representations of environmental states $x$ it doesn’t learn any higher-order abstractions such as concepts. Moreover, vanilla DQN is purely reactive and doesn’t incorporate planning in any meaningful sense. This is where Hierarchical Deep Reinforcement Learning can play a very important role.

In particular, I would like to mention the promising work of Tejas Kulkarni who investigated the use of hierarchical DQN, which has the following architecture:

1. Controller: which learns policies in order to satisfy particular goals
2. Meta-Controller: which chooses goals
3. Critic: which evaluates whether a goal has been achieved

Together these three components cooperate so that a high-level policy is learned over intrinsic goals and a lower-level policy is learned over ‘atomic’ actions to satisfy the given goals. The work, which I’ve only vaguely described, opens up a lot of interesting research directions which may not seem immediately obvious. One I’d like to mention is the possibility of learning a grammar over policies. I think this might be a necessary component for the emergence of language in machines.

The interpretation of the ‘Critic’ is also very interesting. Perhaps one can argue that it provides the agent with a rudimentary form of introspection.

# Conclusion:

I find it remarkable that a simple method such as DQN should inspire many new approaches. Perhaps it’s not so much the brilliance of the method but rather its generality which allowed this method to adapt and evolve. In particular, I think the coupling of Distributional RL with Hierarchical Deep RL has a very bright future. Together, this will lead to signficant improvements in terms of inference and generalisation.

Note: The graphic is taken from [9].

# References:

1. C. J. C. H. Watkins, P. Dayan. Q-learning. 1992.
2. V. Minh, K. Kavukcuoglu, D. Silver et al. Playing Atari with Deep Reinforcement Learning. 2015.
3. H. van Hasselt ,A. Guez and D. Silver. Deep Reinforcement Learning with Double Q-learning. 2015.
4. Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. 2017.
5. Ian Osband, Charles Blundell, Alexander Pritzel and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN. 2016.
6. Tejas Kulkarni et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. 2016.
7. Marc G. Bellemare, Will Dabney and Rémi Munos. A Distributional Perspective on Reinforcement Learning. 2017.
8. Yarin Gal & Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. 2016.
9. Niels Justesen, Philip Bontrager, Julian Togelius, Sebastian Risi. Deep Learning for Video Game Playing. 2017.