Jekyll2018-05-16T00:29:08+00:00/Pauli Space
Investigations on the foundations for intelligence.
Controlling a unicycle with Policy Gradients2018-05-09T00:00:00+00:002018-05-09T00:00:00+00:00/reinforcement/learning/2018/05/09/policy_gradients<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/unicycle_image.png" align="middle" /></center>
<h2 id="motivation">Motivation:</h2>
<p>A few weeks ago I spent some time reflecting on McGeer’s passive dynamic walkers-which walk smoothly down an inclined plane without any digital computation-and wondered whether a reinforcement learning algorithm could discover similar gaits for flat terrain [1]. From a reinforcement learning perspective, Policy Gradients may be the simplest approach that may be used to address bipedal dynamics yet these methods have demonstrated great effectiveness on continuous control tasks [2]. For this reason, I decided to start investigating this problem with Vanilla Policy Gradients.</p>
<p>A secondary motivation came from <a href="http://www.argmin.net/2018/02/20/reinforce/"><em>The Policy of Truth</em></a> article where Ben Recht, a professor of optimisation and machine learning at Berkeley, presented a harsh critique of Policy Gradient methods and the usage of stochastic policies in particular. While Policy Gradient methods definitely have important limitations I demonstrate that the points raised by Ben Recht are non-issues. On the other hand, it must be noted that bipedal walkers can’t be reduced to reinforcement learning. If anything, considering that McGeer’s cleverly-designed walkers don’t do any computations, the embodiment of the bipedal walker must be taken seriously.</p>
<p>In order to get started on a simple problem<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, I decided to <a href="https://github.com/pauli-space/RL_unicycle_control">apply Policy Gradients to unicycles</a> which-like passive dynamic walkers-are dynamically similar to inverse pendulums.</p>
<h2 id="the-policy-gradients-formalism">The Policy Gradients formalism:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/rl.png" align="middle" /></center>
<h3 id="trajectories-policies-and-rewards">Trajectories, Policies and Rewards:</h3>
<p>A trajectory(aka rollout) is a sequence of states <script type="math/tex">s_t</script> and actions <script type="math/tex">u_t</script> generated by a dynamical system:</p>
<script type="math/tex; mode=display">\begin{equation}
\tau_t = (u_0,...,u_t,s_0,...,s_t)
\end{equation}</script>
<p>and a policy <script type="math/tex">\pi_{\theta} (\cdot \lvert s)</script> is a conditional distribution on an agent’s actions <script type="math/tex">a_t \in A</script> given states <script type="math/tex">s_t \in S</script>. This conditional
distribution is typically parametrised by a function approximator(ex. neural network) with parameters <script type="math/tex">\vartheta</script> and serves as a stochastic policy from which we
can sample actions: <script type="math/tex">a_t \sim \pi_{\vartheta} (\cdot \lvert s_{t})</script>.</p>
<p>Now, the objective of Policy Gradients is to find a policy <script type="math/tex">\pi_{\vartheta}</script> that maximises the total reward after <script type="math/tex">H</script> time steps:</p>
<script type="math/tex; mode=display">\begin{equation}
R(\tau) = \sum_{t=0}^H R(s_t,u_t)
\end{equation}</script>
<h3 id="derivation-of-policy-gradients">Derivation of Policy Gradients:</h3>
<p>Given the reward function <script type="math/tex">(2)</script>, our objective function is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
U(\vartheta) & = \mathbb{E}\big[\sum_{t=0}^H R(s_t,u_t);\pi_{\vartheta}\big] \\
& = \sum_{\tau} P(\tau;\vartheta) R(\tau)
\end{split}
\end{equation} %]]></script>
<p>where <script type="math/tex">P(\tau;\vartheta)</script> is the probability distribution over trajectories induced by <script type="math/tex">\pi_{\vartheta}</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
P(\tau;\vartheta) = \prod_{t=0}^H \underbrace{P(s_{t+1}^i \lvert s_t^i,u_t^i)}_\text{dynamics} \underbrace{\pi_{\vartheta}(u_t^i \lvert s_t^i)}_\text{policy}
\end{equation}</script>
<p>Having defined <script type="math/tex">(3)</script>, the Policy Gradients update may be derived as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\nabla_{\vartheta} U(\vartheta) & = \nabla_{\vartheta} \sum_{\tau} P(\tau;\vartheta) R(\tau) \\
& = \sum_{\tau} \nabla_{\vartheta} P(\tau;\vartheta) R(\tau) \\
& = \sum_{\tau} P(\tau;\vartheta) \frac{\nabla_{\vartheta} P(\tau;\vartheta)}{P(\tau;\vartheta)}R(\tau) \\
& = \sum_{\tau} P(\tau;\vartheta) \nabla_{\vartheta} \ln P(\tau;\vartheta) R(\tau)
\end{split}
\end{equation} %]]></script>
<p>A reasonable monte-carlo approximation to <script type="math/tex">(5)</script> would be:</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla_{\vartheta} U(\vartheta) \approx \widehat{g} = \sum_{i=1}^N \nabla_{\vartheta} \ln P(\tau^i;\vartheta) R(\tau^i)
\end{equation}</script>
<p>where each <script type="math/tex">\tau^i</script> denotes a distinct trajectory generated by running a simulator with policy <script type="math/tex">\pi_{\vartheta}</script>.
Crucially, this approximation holds regardless of the dynamics of the environment:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\nabla_{\vartheta} \ln P(\tau^i;\vartheta) & = \nabla_{\vartheta} \ln \big[\prod_{t=0}^H P(s_{t+1}^i \lvert s_t^i,u_t^i) \pi_{\vartheta}(u_t^i \lvert s_t^i)\big]\\
& = \nabla_{\vartheta} \big[\sum_{t=0}^H \ln P(s_{t+1}^i \lvert s_t^i,u_t^i) + \sum_{t=0}^H \ln \pi_{\vartheta}(u_t^i \lvert s_t^i)\big] \\
& = \nabla_{\vartheta} \sum_{t=0}^H \ln \pi_{\vartheta}(u_t^i \lvert s_t^i)
\end{split}
\end{equation} %]]></script>
<p>What is Policy Gradients doing?</p>
<ol>
<li>Increasing the probability of paths with positive reward.</li>
<li>Decreasing the probability of paths with negative reward.</li>
</ol>
<p>But, what if rewards are strictly non-negative?</p>
<h3 id="reducing-the-variance-of-widehatg-with-baselines">Reducing the variance of <script type="math/tex">\widehat{g}</script> with baselines:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/actor-critic.png" align="middle" /></center>
<p>In environments where there are no negative rewards we may extract more signal from our data by using baselines. For concreteness,
we may use a constant baseline:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\nabla_{\vartheta} U(\vartheta) \approx \widehat{g} & = \sum_{i=1}^N \nabla_{\vartheta} \ln P(\tau^i;\vartheta) (R(\tau^i)-b) \\
b & = \frac{1}{N} \sum_{i=1}^N R(\tau^i)
\end{split}
\end{equation} %]]></script>
<p>or even better, we may use a state-dependent baseline <script type="math/tex">b(s_t)</script> which estimates the expected value of the current state
by minimising <script type="math/tex">\lVert b(s_t)-R_t \rVert^2</script> over all trajectories during training. In this case <script type="math/tex">b</script> is a value-estimator
,more often than not parametrised by a neural network, that allows the Policy Gradients model to do temporal difference learning
and therefore exploit the sequential structure of the decision problem. Moreover, using the advantage estimate <script type="math/tex">\widehat{A_t} = R_t-b(s_t)</script> rather
than the reward reduces variance of the gradient estimate as it extracts more signal from the observations by telling the model
how much the current action is better than what is normally done in state <script type="math/tex">s_t</script>.</p>
<p>Furthermore, it’s very useful to note subtracting baselines doesn’t introduce bias into the expectation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\mathbb{E}[\nabla_{\vartheta} \ln P(\tau;\vartheta) b] & = \int P(\tau;\vartheta) \nabla_{\vartheta} \ln P(\tau;\vartheta) b d\tau \\
& = \int \nabla_{\vartheta} P(\tau;\vartheta) b d\tau \\
& = b \nabla_{\vartheta} \int P(\tau;\vartheta) d\tau \\
& = b \nabla_{\vartheta} 1 = 0
\end{split}
\end{equation} %]]></script>
<h3 id="vanilla-policy-gradients">Vanilla Policy Gradients:</h3>
<p>By putting the above ideas together, we end up with a variant of the Vanilla Policy Gradients algorithm:</p>
<ol>
<li>Initialise the policy parameter <script type="math/tex">\vartheta</script> and baseline <script type="math/tex">b</script></li>
<li>For <script type="math/tex">\text{iter}=1,..,\text{maxiter}</script> do: <br />
a. Collect a set of trajectories <script type="math/tex">\{\tau^i\}_{i=1}^N</script> by executing the current policy <script type="math/tex">\pi_{\vartheta}</script> <br />
b. At each time step in the trajectory <script type="math/tex">\tau^i</script> compute <script type="math/tex">R_t = \sum_{t'=t}^{T-1} \gamma^{t'-t}r_t</script> and <script type="math/tex">\widehat{A_t} = R_t-b(s_t)</script>. <br />
c. Re-fit the baseline, by minimising <script type="math/tex">\lVert b(s_t)-R_t \rVert^2</script> summed over all trajectories and time steps. <br />
d. Update the policy <script type="math/tex">\pi_{\vartheta}</script> using a policy gradient estimate <script type="math/tex">\widehat{g}</script> which is a sum of terms <script type="math/tex">\nabla_{\vartheta} \ln \pi_{\vartheta}(u_t \lvert s_t) \widehat{A_t}</script></li>
<li>end for</li>
</ol>
<p>This is the algorithm I used to train the unicycle controller. But, before describing my implementation I’d like to address Ben Recht’s claim that continuous control researchers have no good reason to use stochastic policies.</p>
<h2 id="why-do-we-use-stochastic-policies">Why do we use stochastic policies?:</h2>
<p>In <a href="http://www.argmin.net/2018/02/20/reinforce/"><em>The Policy of Truth</em></a>, Ben Recht is rather dismissive of Policy Gradient methods and argues that stochastic policies are a modeling choice that is never better than using deterministic policies and optimal control methods in general. In particular, Ben argues that if the correct policy is deterministic then the probabilistic model class must must satisfy several show-stopping constraints:</p>
<ol>
<li>rich enough to approximate delta functions</li>
<li>easy to search by gradient methods</li>
<li>easy to sample from</li>
</ol>
<p>Why then do reinforcement learning researchers use Policy Gradient methods? There are several important good reasons:</p>
<ol>
<li>In high-dimensional state-spaces(ex. images) where we might use convolutional neural networks to learn a low-dimensional state-representation <script type="math/tex">\widehat{s_t} \approx s_t</script>, state-aliasing is practically inevitable and so it makes sense to use stochastic policies to reflect the model’s uncertainty.</li>
<li>Sampling from a stochastic policy is a convenient method of exploring the state-space.</li>
<li>Stochastic policies effectively smooth out rough/discontinuous reward landscapes and allow the agent to obtain reward signals it couldn’t obtain otherwise.</li>
</ol>
<p>Moreover, Ben’s three points are actually non-issues:</p>
<ol>
<li>Neural networks, due to Cybenko’s approximation theorem, may be used to parametrise any probability distribution and are easy to search using gradient methods. This is why Variational Inference has found many practical applications and <a href="http://edwardlib.org/tutorials/variational-inference">Edward</a> has gained a lot of traction among statistical machine learning researchers
and engineers.</li>
<li>Neural networks are easy to sample from. In fact, inference with neural networks is computationally cheap.</li>
<li>A corollary of my first point is that we may easily approximate delta functions. If your policy is a conditional Gaussian,which is what I used for controlling the unicycle, then your distribution definitely contains good approximations of delta functions.</li>
</ol>
<p>To be honest, I am surprised that Ben Recht didn’t bring up any real issues such as curriculum learning<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> and transfer learning where a lot of active research is currently ongoing[5]. Moreover, it’s important to note that Policy Gradient methods were developed for problems where optimal control theory, Ben Recht’s proposed alternative, doesn’t work at all.</p>
<h2 id="controlling-a-unicycle-with-policy-gradients">Controlling a unicycle with Policy Gradients:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/unicycle_image.png" align="middle" /></center>
<h3 id="modelling-assumptions">Modelling assumptions:</h3>
<p>As shown in [4] iff we suppose that our unicyclist is riding on a level surface without turning, motion in the wheel plane may be modelled by a planar inverted pendulum of diameter <script type="math/tex">l</script> with horizontally moving support. After cancelling out the mass term, an analysis of the force diagrams shows that we have the following Newtonian equation
of motion:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
l \ddot{\theta} & = g \sin(\theta)-\ddot{z}\cos(\theta) \\
\ddot{z} & = gh(\theta,\dot{\theta})
\end{split}
\end{equation} %]]></script>
<p>where <script type="math/tex">h</script> may be identified with a unicycle controller <script type="math/tex">\pi_{\vartheta}</script> and represents the riders acceleration of the wheel as a reaction to the instantaneous angular momenta. By defining <script type="math/tex">\alpha = \frac{g}{l}</script> we have the simpler
equation:</p>
<script type="math/tex; mode=display">\begin{equation}
\ddot{\theta} = \alpha(\sin(\theta)-h\cos(\theta))
\end{equation}</script>
<p>Now, assuming that the rider starts approximately upright(i.e. <script type="math/tex">\theta \approx 0</script>) we may linearise <script type="math/tex">(11)</script> so we have:</p>
<script type="math/tex; mode=display">\begin{equation}
\ddot{\theta} = \alpha(\theta-h)
\end{equation}</script>
<p>and there exist relatively simple stable solutions for this linearised equation such as:</p>
<script type="math/tex; mode=display">\begin{equation}
h(\theta) = a\theta, a > 1
\end{equation}</script>
<p>but a more sophisticated unicyclist would anticipate the future consequences of the rate <script type="math/tex">\dot{\theta}</script> as well as react to the angle of the fall <script type="math/tex">\theta</script>. This way, he/she may be able to overcome the effects of a finite reaction time. For this reason, I defined the state of the unicycle system to be:</p>
<script type="math/tex; mode=display">\begin{equation}
s_t = \begin{bmatrix}
\theta_t \\
\dot{\theta}_t \\
\end{bmatrix}
\end{equation}</script>
<h3 id="the-simulator">The simulator:</h3>
<p>In order to generate rollouts with a policy <script type="math/tex">\pi_{\vartheta}</script> we must choose a method for numerically integrating <script type="math/tex">(12)</script> and after some reflection I opted for
Velocity Verlet due to its simplicity and numerical stability properties. Assuming that we have evaluated <script type="math/tex">\ddot{\theta}</script> and fixed a reasonable value for <script type="math/tex">\Delta t</script> we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\theta_{i+1} & = \theta_i + \dot{\theta}\Delta t + \frac{1}{2}\ddot{\theta}_i \Delta t\\
\dot{\theta}_{i+1} & = \dot{\theta}_i + \frac{1}{2}(\ddot{\theta}_i + \ddot{\theta}_{i+1}) \Delta t
\end{split}
\end{equation} %]]></script>
<p>Doing this for both <script type="math/tex">\theta</script> and <script type="math/tex">z</script> in TensorFlow, I defined the following Velocity Verlet function:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">velocity_verlet</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
Update the unicycle system using Velocity Verlet integration.
"""</span>
<span class="c">## update rules for theta:</span>
<span class="n">dd_theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="o">*</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="p">)</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">action</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="p">))</span>
<span class="n">theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">theta</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span><span class="p">)</span>
<span class="n">d_theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">dd_theta</span><span class="o">+</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span>
<span class="c">## update rules for z:</span>
<span class="n">dd_z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">g</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">action</span>
<span class="n">z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">z</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_z</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span><span class="p">)</span>
<span class="n">d_z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_z</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">dd_z</span><span class="o">+</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span>
<span class="n">step</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">dd_theta</span><span class="p">),</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">theta</span><span class="p">),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">d_theta</span><span class="p">),</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">dd_z</span><span class="p">),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">z</span><span class="p">),</span><span class="bp">self</span><span class="o">.</span><span class="n">d_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">d_z</span><span class="p">))</span>
<span class="k">return</span> <span class="n">step</span>
</code></pre></div></div>
<p>Now, in order to encourage the policy network to discover controllers in the linear domain of the unicycle system, i.e. <script type="math/tex">(12)</script>,
I initialised the variables at the beginning of each rollout in the following manner:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">restart</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
A simple method for restarting the system where the angle is taken to be a slight
deviation from the ideal value of theta = 0.0 and the system has an important
initial horizontal acceleration.
"""</span>
<span class="n">step</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span><span class="n">stddev</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span><span class="n">mean</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span><span class="n">stddev</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">step</span>
</code></pre></div></div>
<p>The key part in this snippet of code is that <script type="math/tex">\theta_0 \sim \mathcal{N}(0,0.1)</script>.</p>
<h3 id="the-policy-gradients-model">The Policy Gradients model:</h3>
<p>For the unicycle controller(i.e. the policy <script type="math/tex">\pi_{\vartheta}</script>) I defined a conditional Gaussian as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">two_layer_net</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">w_h</span><span class="p">,</span> <span class="n">w_h2</span><span class="p">,</span> <span class="n">w_o</span><span class="p">,</span><span class="n">bias_1</span><span class="p">,</span> <span class="n">bias_2</span><span class="p">):</span>
<span class="s">"""
A generic method for creating two-layer networks
input: weights
output: neural network
"""</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">elu</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">w_h</span><span class="p">),</span><span class="n">bias_1</span><span class="p">))</span>
<span class="n">h2</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">elu</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">w_h2</span><span class="p">),</span><span class="n">bias_2</span><span class="p">))</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h2</span><span class="p">,</span> <span class="n">w_o</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">controller</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
The policy gradient model is a neural network that
parametrises a conditional Gaussian.
input: state(i.e. angular momenta)
output: action to be taken i.e. appropriate horizontal acceleration
"""</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">variable_scope</span><span class="p">(</span><span class="s">"policy_net"</span><span class="p">):</span>
<span class="n">tf</span><span class="o">.</span><span class="n">set_random_seed</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">seed</span><span class="p">)</span>
<span class="n">W_h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span><span class="mi">100</span><span class="p">],</span><span class="s">"W_h"</span><span class="p">)</span>
<span class="n">W_h2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">100</span><span class="p">,</span><span class="mi">50</span><span class="p">],</span><span class="s">"W_h2"</span><span class="p">)</span>
<span class="n">W_o</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">50</span><span class="p">,</span><span class="mi">10</span><span class="p">],</span><span class="s">"W_o"</span><span class="p">)</span>
<span class="c"># define bias terms:</span>
<span class="n">bias_1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">100</span><span class="p">],</span><span class="s">"bias_1"</span><span class="p">)</span>
<span class="n">bias_2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">50</span><span class="p">],</span><span class="s">"bias_2"</span><span class="p">)</span>
<span class="n">eta_net</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">two_layer_net</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">pv</span><span class="p">,</span><span class="n">W_h</span><span class="p">,</span> <span class="n">W_h2</span><span class="p">,</span> <span class="n">W_o</span><span class="p">,</span><span class="n">bias_1</span><span class="p">,</span><span class="n">bias_2</span><span class="p">)</span>
<span class="n">W_mu</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">10</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="s">"W_mu"</span><span class="p">)</span>
<span class="n">W_sigma</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">10</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="s">"W_sigma"</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">mu</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">eta_net</span><span class="p">,</span><span class="n">W_mu</span><span class="p">)),</span><span class="bp">self</span><span class="o">.</span><span class="n">action_bound</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">eta_net</span><span class="p">,</span><span class="n">W_sigma</span><span class="p">)),</span><span class="bp">self</span><span class="o">.</span><span class="n">variance_bound</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span>
</code></pre></div></div>
<p>and sampling from this Gaussian is as simple as doing:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sample_action</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
Samples an action from the stochastic controller which happens
to be a conditional Gaussian.
"""</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">distributions</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">,</span><span class="n">tf</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span><span class="p">))</span>
<span class="k">return</span> <span class="n">dist</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
</code></pre></div></div>
<p>Likewise, you may check my <a href="https://github.com/pauli-space/RL_unicycle_control/blob/master/vanilla/vanilla_pg.py">Policy Gradients class</a> to check
out how I calculated the baseline.</p>
<h3 id="defining-rewards">Defining rewards:</h3>
<p>Defining the reward might be the most tricky part of this experiment as it’s not obvious how the unicycle controller should be rewarded and from a dynamical
systems perspective a reward doesn’t really make sense. Either the unicycle has a stable controller or it doesn’t. But, in order to adhere to the Policy Gradients
formalism I opted for the instantaneous height of the unicycle as the reward. This is as simple as:</p>
<script type="math/tex; mode=display">\begin{equation}
\text{height} = l\sin(\theta)
\end{equation}</script>
<p>and as a result the REINFORCE loss(i.e. no baseline subtracted) is simply:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">reinforce_loss</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
The REINFORCE loss without subtracting a baseline.
"""</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">distributions</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span><span class="p">))</span>
<span class="k">return</span> <span class="n">dist</span><span class="o">.</span><span class="n">log_prob</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">height</span>
</code></pre></div></div>
<h3 id="results">Results:</h3>
<p>After defining reasonable hyperparameters(<script type="math/tex">\Delta t = 0.01</script>, rollouts per batch = 10, horizon = 30, total epochs=100,…) I ran a few tests in the <a href="https://github.com/pauli-space/RL_unicycle_control/blob/master/vanilla/does_it_work.ipynb">following
notebook</a> and found that the learned controller was
very good at bringing the unicycle to maximum height but terrible at keeping the unicycle at that height once it got there. In a word, it was very unstable.</p>
<p>Actually, if we evaluate the model <script type="math/tex">\pi_{\vartheta}</script> with <script type="math/tex">\dot{\theta}=0</script> we find that the unicycle exhibits highly nonlinear behaviour in the neighborhood
of <script type="math/tex">\theta = -\frac{\pi}{2}</script></p>
<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/phase_transition.png" align="middle" /></center>
<p>So the force is negative for angles greater than <script type="math/tex">-\frac{\pi}{2}</script> and positive for angles less than <script type="math/tex">-\frac{\pi}{2}</script>. Given that this is many standard deviations away
from <script type="math/tex">\theta = 0</script>, this is most likely due to my choice of tanh activation for the final layer of the policy network.</p>
<p>Furthermore, if we analyse the learned controller behaviour as a function of the full state(i.e. <script type="math/tex">\theta</script> and <script type="math/tex">\dot{\theta}</script>) we observe the following:</p>
<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/nonlinear_model.png" align="middle" /></center>
<p>The combination of the short-sighted policy and the tanh nonlinearity makes me wonder whether there are small tweaks to my TensorFlow functions which may lead to much
better results.</p>
<h2 id="discussion">Discussion:</h2>
<p>While I think the results are interesting and may be improved from a technical perspective, I don’t think any particular reward function makes sense for learning re-usable locomotion behaviours. Ideally, the agent would be able to propose and learn curriculums of locomotion behaviours in an unsupervised manner in order to learn models of its affordances and its intrinsic locomotory options. One promising approach to this was articulated by Sébastien Forestier, Yann Mollard and Pierre Oudeyer in [6] but I’ll have to think about how Intrinsically Motivated Goal Exploration can be assimilated within the Policy Gradients framework.</p>
<h1 id="references">References:</h1>
<ol>
<li>Passive Dynamic Walking. T. McGeer. 1990.</li>
<li>Emergence of Locomotion Behaviours in Rich Environments. Nicolas Heess et al. 2017.</li>
<li>Policy Gradients for Reinforcement Learning with Function Approximation. Richard S. Sutton, David McAllester, Satinder Singh & Yishay Mansour. 1999.</li>
<li>Unicycles and Bifurcations. R. C. Johnson. 2002.</li>
<li>Modular Multitask Reinforcement Learning with Policy Sketches. Jacob Andreas, Dan Klein, and Sergey Levine. 2017.</li>
<li>Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning. Sébastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. 2017.</li>
</ol>
<h2 id="footnotes">Footnotes:</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>In my experience it always helps to work on tractable and conceptually interesting variants of complex problems(ex. bipedal locomotion) as these problems often provide deep insights into the complex problem of interest. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>I’d be interested to see how optimal control theory may be used for curriculum learning. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aidan RockeUnderstanding the free energy principle2018-04-12T00:00:00+00:002018-04-12T00:00:00+00:00/active/inference/2018/04/12/free_energy<h2 id="motivation">Motivation:</h2>
<p>My general interest in single-motivation theories stems from the belief that a common ancestor for all multi-cellular organisms might imply
common principles of intelligent behaviour. It’s a somewhat reductive hypothesis and as I argued last week, <a href="http://paulispace.com/statistics/2018/04/07/causal_path_entropy.html">some of these theories might be
too reductive</a>, but I think it’s a useful working hypothesis that can take
behavioural scientists very far<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. However, until recently I wasn’t properly acquainted with the free energy principle which, from a distance,
appears to be one of the more plausible single-motivation theories.</p>
<p>The free energy principle is a theory developed by Karl Friston and others to explain how biological systems tend to avoid disorder by limiting themselves
to a small number of favorable states. It comes across as a rather abstract mathematical theory but thanks to a <a href="http://romainbrette.fr/what-is-computational-neuroscience-xxix-the-free-energy-principle/">critical thought experiment</a> proposed by <a href="https://twitter.com/RomainBrette">Romain Brette</a> I found an opportunity to take
a closer look at this theory. In fact, I promised Brette that I would run a computer simulation demonstrating that his thought experiment rests upon flawed assumptions(<a href="https://github.com/pauli-space/Free_Energy_experiments">code here</a>).</p>
<p>In this context, the goal of this blog post is to explain the main idea of the free energy principle and dissect Romain Brette’s thought experiment
in order to develop a practical understanding of this theory.</p>
<h2 id="the-free-energy-principle">The Free Energy Principle:</h2>
<p>In [1], Karl Friston proposes that the Free Energy principle may be a rough guide to the brain and makes the following points:</p>
<ol>
<li>The free energy principle basically applies to any biological system that resists a tendency to disorder.</li>
<li>The free energy principle rests upon the fact that self-organising biological systems resist a tendency to disorder and therefore minimise entropy
of their sensory states.</li>
<li>Assuming that <script type="math/tex">m</script> corresponds to a generative model describing the biological system and <script type="math/tex">y</script> refers to the system’s sensory states, under ergodic assumptions, the entropy is:</li>
</ol>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation*}
\begin{split}
H(y) & = -\int P(y|m) \ln P(y|m) \,dy \\
& = \lim_{T \to \infty} \int_{0}^{T} - \ln P(y(t)|m) \,dt
\end{split}
\end{equation*} %]]></script>
<p>Now, given that entropy is the long-term average of surprise(think of a monte carlo simulation), agents must avoid surprising states where surprise is defined
relative to homeostatic conditions of that particular organism<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>The three points above are sufficient to understand Romain Brette’s thought experiment though I must emphasise that surprisal here is defined in terms of the
agent’s homeostatic conditions so minimisation of surprisal corresponds to both minimisation of epistemic uncertainty(i.e. unknown unkowns) as well as
statistical uncertainty(i.e. known unknowns).</p>
<h2 id="romain-brettes-thought-experiment">Romain Brette’s thought experiment:</h2>
<p>In Brette’s article, he summarises the free energy principle in the following manner:</p>
<blockquote>
<p>The free energy principle is the theory that the brain manipulates a probabilistic generative model of its sensory inputs,
which it tries to optimise by either changing the model(learning) or changing the inputs(by acting).</p>
</blockquote>
<p>Although I haven’t mentioned anything about the human brain so far, this is a relatively good summary, and Brette proceeds with
the following food vs. no food thought experiment:</p>
<ol>
<li>An agent has two kinds of observations/stimuli: food and the absence of food.</li>
<li>This agent has two possible actions: seek food or don’t seek food.</li>
<li>When the agent seeks food there’s a 20% probability of getting food.</li>
<li>When the agent doesn’t seek food there’s a 100% probability of getting no food.</li>
</ol>
<p>What should a surprise minimising agent do? Romain presents the following argument:</p>
<blockquote>
<p>What does the free energy principle tell us? To minimize surprise, it seems clear that I should sit: I am certain to not see food. No surprise at all. The proposed solution is that you have a prior expectation to see food. So to minimize the surprise, you should put yourself into a situation where you might see food, ie to seek food. This seems to work. However, if there is any learning at all, then you will quickly observe that the probability of seeing food is actually 20%, and your expectations should be adjusted accordingly. Also, I will also observe that between two food expeditions, the probability to see food is 0%. Once this has been observed, surprise is minimal when I do not seek food. So, I die of hunger. It follows that the free energy principle does not survive Darwinian competition.</p>
</blockquote>
<p>Basically, Romain argues that surprise is minimal when the organism doesn’t seek food assuming that Friston’s definition of surprisal corresponds to minimisation of
statistical uncertainty. Given that Friston’s surprisal is defined in terms of the agent’s homeostatic conditions, this assumption is precisely where Romain’s analysis
breaks down. It also helps to simulate such toy problems on a computer, if possible, because in a simulation you have to make every modelling assumption clear.</p>
<h2 id="a-reasonable-model-of-brettes-problem">A reasonable model of Brette’s problem:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/Free_Energy_experiments/master/diagram.png" align="middle" /></center>
<p>To simulate Romain’s problem, I made the following assumptions:</p>
<ol>
<li>We have an organism which has to eat <script type="math/tex">k</script> times on average in the last 24 hours and can eat at most once per hour.</li>
<li>The homeostatic conditions of our organism are given by a Gaussian distribution centered at <script type="math/tex">k</script> with unit variance, a Gaussian food critic if you will. This specifies that our organism should’t eat much less than <script type="math/tex">k</script> times a day and shouldn’t eat a lot more than <script type="math/tex">k</script> times a day. In fact, this explains why living organisms tend to
have masses that are normally distributed during adulthood.</li>
<li>A food policy consists of a 24-dimensional vector where the values range from 0.0 to 1.0 and we want to maximise the negative log probability that the total consumption is drawn from the Gaussian food critic.</li>
<li>Food policies are the output of a generative neural network(setup using TensorFlow) whose inputs are either one or zero to indicate a survival prior, with one indicating a preference for survival.</li>
<li>The backpropagation algorithm, in this case Adagrad [5], functions as a homeostatic regulator by updating the network with variations in the network weights proportional to the negative logarithmic loss(i.e. surprisal).</li>
</ol>
<p>Assuming <script type="math/tex">k=3</script>, I ran a simulation in the <a href="https://github.com/pauli-space/Free_Energy_experiments/blob/master/simulation.ipynb">following notebook</a> and found that the discovered food policy differs significantly from Romain’s expectation that the agent would choose to not look for food in order to minimise surprisal. In fact, our simple agent manages to get three meals per day on average so it survives.</p>
<p>Overall, this is a relatively simple problem with a fixed prior(i.e. fixed belief) as the organism doesn’t have to do more than eat. So I can minimise surprise directly but in general, if we have adjustable beliefs(ex. models of physics and their physical parameters/constants) then we have a much harder problem and that’s where I would need to use the KL-divergence and invoke free energy minimisation, rather than directly minimising surprisal. However, these models and their parameters would still be evaluated with respect to homeostatic constraints. This guarantees that the organism isn’t simply trying to minimise statistical uncertainty.</p>
<h2 id="conclusion">Conclusion:</h2>
<p>Until recently, the Free Energy Principle has been a constant source of mockery from neuroscientists who misunderstood it and so I hope that by growing a collection
of <a href="https://github.com/pauli-space/Free_Energy_experiments">free-energy motivated reinforcement learning examples on Github</a> we may finally have a constructive discussion
between scientists. Moreover, I have been asked whether it’s not immodest for Karl Friston to suggest that his theory might be a model for human behaviour. Well, my answer
to that question is the same answer I would give to the critics of Empowerment[7].</p>
<p>Let’s see how far ingenious implementations(i.e. experiments) using these formalisms can take us. That’s the only way we’ll know what the limitations of these
theories are.</p>
<h1 id="references">References:</h1>
<ol>
<li>The free-energy principle: a rough guide to the brain? (K. Friston. 2009.)</li>
<li>The Markov blankets of life: autonomy, active inference and the free energy principle (M. Kirchhoff, T. Parr, E. Palacios, K. Friston and J. Kiverstein. 2018.)</li>
<li>Free-Energy Minimization and the Dark-Room Problem (K. Friston, C. Thornton and A. Clark. 2012.)</li>
<li>What is computational neuroscience? (XXIX) The free energy principle (R. Brette. 2018.)</li>
<li>Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (J. Duchi, )</li>
<li>Empowerment — An Introduction. C. Salge et al. 2013.</li>
<li>Reward, Motivation, and Reinforcement Learning (P. Dayan and B. Balleine. 2002.)</li>
</ol>
<h1 id="footnotes">Footnotes:</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The notion of utility maximisation in economics, though limited, has been very useful for example. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In [2], homeostatic conditions of an organism are defined in terms of Markov Blankets which are equivalent to the boundaries of a system in a statistical sense. I would encourage the reader to go into that paper after going through this blog post but this concept isn’t essential for understanding Romain’s thought experiment, so we’ll ignore this formalism for now. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aidan RockeMotivation:Approximating Causal Path Entropy in Euclidean spaces2018-04-07T00:00:00+00:002018-04-07T00:00:00+00:00/statistics/2018/04/07/causal_path_entropy<h2 id="motivation">Motivation:</h2>
<p>Last week I spoke to Alex Gomez-Marin, a behavioural neuroscientist who had a passing interest in the theory of Causal Entropic Forces
about determining the Causal Entropic Force on a dimensionless particle contained in a 2-D heat reservoir. I promised to try and work
out an approximation of the Causal Entropic Force on the particle. Meanwhile,it has been almost a year since I last wrote an
<a href="http://paulispace.com/intelligence/2017/07/06/maxent.html">article on the matter</a> and since then I have developed a better understanding
of this theory which Dr. Wissner-Gross calls an ‘<script type="math/tex">E=mc^2</script> for intelligence’.</p>
<p>You might guess, from my slightly reticent tone, that I’m no longer the biggest fan of this theory. While I won’t lambast the theory
as Gary Marcus has done in the following <a href="https://www.newyorker.com/tech/elements/a-grand-unified-theory-of-everything">New Yorker article</a>
I now think that on the balance his criticism was spot on. To understand why, I shall present a constructive dissection of the theory by going
through its principles and simulating the toy problem of a particle in a heat reservoir(<a href="https://github.com/pauli-space/Causal_Path_Entropy">code here</a>).</p>
<h2 id="causal-entropic-forces">Causal entropic forces:</h2>
<p>In the following summary of Wissner’s meta-heuristic, it’s assumed that the agent has access to an approximate or exact simulator. A close reading of
the original paper [1] will show that this assumption is actually necessary.</p>
<h3 id="macrostates">Macrostates:</h3>
<p>For any open thermodynamic system, we treat the phase-space paths taken by the system <script type="math/tex">x(t)</script> over the time interval <script type="math/tex">[0,\tau]</script> as microstates
and partition them into macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script> using the equivalence relation[1]:</p>
<script type="math/tex; mode=display">\begin{equation}
x(t) \sim x'(t) \iff x(0) = x'(0)
\end{equation}</script>
<p>As a result, we can identify each macrostate <script type="math/tex">X_i</script> with a unique present system state <script type="math/tex">x(0)</script>. This defines a notion of causality over a time interval.</p>
<h3 id="causal-path-entropy">Causal path entropy:</h3>
<p>We can define the causal path entropy <script type="math/tex">S_c</script> of a macrostate <script type="math/tex">X_i</script> with the associated present system state <script type="math/tex">x(0)</script> as the path integral:</p>
<script type="math/tex; mode=display">\begin{equation}
S_c (X_i, \tau) = -k_B \int_{x(t)} P(x(t)|x(0)) \ln P(x(t)|x(0)) \,D x(t)
\end{equation}</script>
<p>where we have:</p>
<script type="math/tex; mode=display">\begin{equation}
P(x(t)| x(0)) = \int_{x^*(t)} P(x(t),x^*(t) |x(0)) \,D x^*(t)
\end{equation}</script>
<p>In (3) we basically integrate over all possible paths <script type="math/tex">x^*(t)</script> taken by the open/closed system’s environment. In practice, this integral is intractable
and we must resort to approximations which we shall discuss shortly.</p>
<h3 id="causal-entropic-force">Causal entropic force:</h3>
<p>A path-based causal entropic force <script type="math/tex">F</script> may be expressed as:</p>
<script type="math/tex; mode=display">\begin{equation}
F(X_0, \tau) = T_c \nabla_X S_c (X, \tau) |_{X_0}
\end{equation}</script>
<p>where <script type="math/tex">T_c</script> and <script type="math/tex">\tau</script> are two free parameters. This force basically brings us closer to macrostates <script type="math/tex">X_j</script> that
maximize <script type="math/tex">S_c (X_i, \tau)</script>. In essence the combination of equations (2), (3) and (4) maximize the number of future options
of our agent. This isn’t very different from what many people try to do in life but this meta-heuristic does have very important
limitations.</p>
<p>The main limitation is that the agent actually needs to have access to the true state-transition probabilities of its environment
and if such a model is to be learned, the authors of the original paper[1] don’t say how.</p>
<h2 id="a-toy-problem">A toy problem:</h2>
<p>When simulating the toy problem of a dimensionless particle in a square heat reservoir, I made the following assumptions:</p>
<ol>
<li>The room is a 10x10 square and the walls are inelastic.</li>
<li>Given that state is represented by the particle’s position and the room is convex, the euclidean distance is a good metric for measuring the difference between states.</li>
<li>Assuming that the Causal Path Entropy varies continuously over states, we have a second argument for discretisation and may use the max operator rather than the
nabla operator to discover local maxima.</li>
<li>Assuming that the Causal Path Entropy is proportional to a propensity for mixing, we may approximate variations in Causal Path Entropy with Euclidean proxy measures for diffusion such as average nearest neighbours and the radius of gyration.</li>
<li>The particle isn’t quite dimensionless though it’s relatively small with respect to the room which allows us to approximate the Causal Path Entropy with the Boltzmann Entropy.</li>
</ol>
<p>Considering these four assumptions, I tried using two proxy measures. I first tried using the average nearest neighbour measure as a proxy for dispersion though this wasn’t
quite as reliable as I hoped so I experimented with the radius of gyration of an ensemble of terminal states as a proxy for diffusion as suggested in [2]. Below is a figure demonstrating convergence to the centre of the room using the radius of gyration as a proxy measure:</p>
<center><img src="https://raw.githubusercontent.com/pauli-space/Causal_Path_Entropy/master/images/distance_from_centre_of_room.png" align="middle" /></center>
<p>Interestingly, the second measure performed much better than the first and I suspect that this is because the radius of gyration implicitly exploits the fact that
the square is convex and therefore the centre of the square may be identified with the largest inscribed circle. This begs the question as to how general these
proxy measures actually are and whether we can hope to efficiently calculate path entropy for non-trivial systems even if we assume that a simulator is in
fact available.</p>
<h2 id="an-e--mc2-for-intelligence">An <script type="math/tex">E = mc^2</script> for intelligence?</h2>
<p>To be fair with the Causal Entropic Forces theory, I think it’s necessary to compare it with other prominent single-motivation theories such as the Free Energy
Principle which aims to minimise prediction error and the theory of Empowerment which encourages agents to maximise their number of intrinsic options[3,4]. Unlike these other theories which are frameworks for learning, inference and decision-making the theory of Causal Entropic Forces is mainly a framework for decision making and simulation assuming that a simulator fo the environment is known to the agent. Moreover, given that an Empowerment maximising agent maximises its number of intrinsic options the Causal
Entropic Force is merely a third-rate Empowerment variant.</p>
<p>Finally, even in the event that such a simulator is available(ex. Chess/Go) you would actually need to design a clever search algorithm for that particular environment.
In non-trivial environments, you can’t actually use the nabla operator as proposed by Wissner-Gross to move the agent towards more promising states. For these
reasons, I think it’s completely silly to compare this five-page theory of ‘intelligence’ with Einstein’s labours on the theory of relativity.</p>
<h1 id="references">References:</h1>
<ol>
<li>Causal Entropic Forces (A. D. Wissner-Gross & C.E. Freer. 2013. Physical Review Letters.)</li>
<li>Causal Entropic Forces: Intelligent Behaviour, Dynamics and Pattern Formation (Hannes Hornischer. 2015. Masters Thesis.)</li>
<li>The free-energy principle: a rough guide to the brain? Friston. 2005.</li>
<li>Empowerment — An Introduction. C. Salge et al. 2013.</li>
</ol>Aidan RockeMotivation:Fractals with TensorFlow2018-03-25T00:00:00+00:002018-03-25T00:00:00+00:00/tensorflow/2018/03/25/tensorflow_fractals<center><img src="https://i.stack.imgur.com/f4ned.png" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>Last week, it occurred to me to experiment with Mandelbrot sequences with variable exponents and after
a few experiments using <a href="https://github.com/AidanRocke/TensorFlow-Fractals">TensorFlow-Fractals</a> I made
a couple <a href="https://math.stackexchange.com/questions/2705107/symmetries-of-mandelbrot-sets-with-integer-exponents">mathematical observations</a>
which surprised me a little. My principal interest in fractals besides mathematical beauty is that their
massively parallel nature makes them a good benchmark for GPUs. In fact, one of my projects in the near
future will be to simulate Quaternion fractals on GPUs with TensorFlow [2].</p>
<p>Before continuing, I must say that from a mathematical perspective everything here is rather naive
but my philosophy is that it’s always better to get started and add more layers of sophistication later.</p>
<h2 id="the-mandelbrot-sequence">The Mandelbrot sequence:</h2>
<p>Mandelbrot sets are defined in terms of the following quadratic sequence in the complex plane:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = z_n^2 + c \\
c = z_0 \in \mathbb{C}
\end{cases}
\end{equation}</script>
<p>Using this sequence, the Mandelbrot is normally defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
M = \{z_0 \in \mathbb{C}: \lim_{n \to \infty} |z_n| < \infty \}
\end{equation} %]]></script>
<p>Now, given that <script type="math/tex">z_n</script> might be an oscillating sequence we need to resort to a few approximations in order to simulate Mandelbrot sets
on a computer. Here’s a short list:</p>
<ol>
<li>Finite precision.</li>
<li>Stopping criteria for divergence.</li>
<li>Stopping criteria for the number of iterates.</li>
</ol>
<p>To address these issues we use 32-bit floating point numbers, a pre-defined upper-bound on the modulus of <script type="math/tex">z_n</script> and a limit on the number of iterations.
With an upper-bound of 7.0 and a limit of 500 iterations, the reader should obtain an image similar to the following figure:</p>
<center><img src="https://i.stack.imgur.com/MRDvL.png" align="middle" style="width:600px;height:600px;" /></center>
<p>This is as much as I will say about Mandelbrot sets although if the reader is interested in learning more, I highly recommend the
<a href="http://mathworld.wolfram.com/MandelbrotSet.html">primer on Wolfram MathWorld</a>.</p>
<h2 id="generalised-mandelbrot-sequences">Generalised Mandelbrot sequences:</h2>
<p>Things became interesting when I experimented with recursive equations of the form:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = z_n^\alpha + c \\
c = z_0 \in \mathbb{C}, \alpha \in \mathbb{Z}
\end{cases}
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = \overline{z_n}^\alpha + c \\
c = z_0 \in \mathbb{C}, \alpha \in \mathbb{Z}
\end{cases}
\end{equation}</script>
<p>Using equation <script type="math/tex">(3)</script> I obtained the following images for exponents of <script type="math/tex">-2.0</script> and <script type="math/tex">-4.0</script>:</p>
<center><img src="https://i.stack.imgur.com/f4ned.png" align="middle" style="width:600px;height:600px;" /></center>
<p><br /></p>
<center><img src="https://i.stack.imgur.com/8FSK0.png" align="middle" style="width:600px;height:600px;" /></center>
<p>In fact, I made the following observations:</p>
<ol>
<li>Using equation <script type="math/tex">(3)</script>, the resulting structure has <script type="math/tex">\alpha-1</script> symmetries when <script type="math/tex">\alpha \geq 2</script> and <script type="math/tex">\lvert \alpha \rvert +1</script> symmetries when <script type="math/tex">\alpha \leq -2</script>.</li>
<li>Using equation <script type="math/tex">(4)</script>, the resulting structure has <script type="math/tex">\alpha+1</script> symmetries when <script type="math/tex">\alpha \geq 2</script> and <script type="math/tex">\lvert \alpha \rvert-1</script> symmetries when <script type="math/tex">\alpha \leq -2</script>.</li>
</ol>
<p>So far I don’t have a good explanation for these results but I hope to discover the reason behind the symmetries of these fractal structures
before the end of next week.</p>
<h2 id="whats-next">What’s next:</h2>
<p>Before investigating Quaternion Mandelbrot sets on GPUs, I would like to take a closer look at the following questions:</p>
<ol>
<li>Numerical stability as a function of <script type="math/tex">\alpha</script> and <script type="math/tex">z_0</script>.</li>
<li>Might there be better stopping criteria besides hard-coded bounds on the modulus of <script type="math/tex">z_n</script> and the maximum number of iterates.</li>
<li>Is the Mandelbrot set computable? (Note: this has been <a href="https://cs.stackexchange.com/questions/42685/in-what-sense-is-the-mandelbrot-set-computable">discussed on the CS stackexchange</a>.)</li>
</ol>
<p>These questions don’t quite fall under the category of intelligent behaviour but who knows? On the one hand, the Universe might just be a set of simple rules which are applied in a recursive manner. On the other hand, fractals provide researchers with an effective(and beautiful) way of benchmarking hardware and software performance.</p>
<p>Either way, the moral of the story is that playing with Mandelbrot sets is always an opportunity to learn something new about computation.</p>
<h1 id="references">References:</h1>
<ol>
<li>Fractal Art Generation using GPUs. Mayfield et al. 2016.</li>
<li>Ray Tracing Quaternion Julia Sets on the GPU. Keenan Crane. 2005.</li>
<li>Non-computable Julia Sets. M. Braverman, M. Yampolsky. 2005.</li>
</ol>Aidan RockeNormal approximation to uniform distribution2018-03-13T00:00:00+00:002018-03-13T00:00:00+00:00/statistics/2018/03/13/normal_approximation<h2 id="motivation">Motivation:</h2>
<p>Earlier today I was talking to a researcher about how well a normal distribution could approximate a uniform distribution
over an interval <script type="math/tex">[a,b] \subset \mathbb{R}</script>. I gave a few arguments for why I thought a normal distribution wouldn’t be good
but I didn’t have the exact answer at the top of my head so I decided to find out. Although the following analysis involves
nothing fancy I consider it useful as it’s easily generalised to higher dimensions(i.e. multivariate uniform distributions)
and we arrive at a result which I wouldn’t consider intuitive.</p>
<p>For those who appreciate numerical experiments, I wrote a small TensorFlow script to accompany this blog post.</p>
<h2 id="statement-of-the-problem">Statement of the problem:</h2>
<p>We would like to minimise the KL-Divergence:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{D}_{KL}(P|Q) = -\int_{-\infty}^\infty p(x) \ln \frac{p(x)}{q(x)}dx
\end{equation}</script>
<p>where <script type="math/tex">P</script> is the target uniform distribution and <script type="math/tex">Q</script> is the approximating Gaussian:</p>
<script type="math/tex; mode=display">\begin{equation}
p(x)= \frac{1}{b-a} \mathbb{1}_{[b-a]} \implies p(x \notin [b-a]) = 0
\end{equation}</script>
<p>and</p>
<script type="math/tex; mode=display">\begin{equation}
q(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x-\mu)^2}{2 \sigma^2}}
\end{equation}</script>
<p>Now, given that <script type="math/tex">\lim_{x \to 0} x\ln(x) = 0</script> if we assume that <script type="math/tex">(a,b)</script> is fixed our loss may be expressed in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\mathcal{L}(\mu,\sigma) & = -\int_{a}^b p(x) \ln \frac{p(x)}{q(x)}dx \\
& = \ln(b-a) - \frac{1}{2}\ln(2\pi\sigma^2)-\frac{\frac{1}{3}(b^3-a^3)-\mu(b^2-a^2)+\mu^2(b-a)}{2\sigma^2(b-a)} \end{split}
\end{equation} %]]></script>
<h2 id="minimising-with-respect-to-mu-and-sigma">Minimising with respect to <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</h2>
<p>We can easily show that the mean and variance of the Gaussian which minimises <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> correspond to the
mean and variance of a uniform distribution over <script type="math/tex">[a,b]</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \mu} \mathcal{L}(\mu,\sigma) = \frac{(b+a)}{2\sigma^2} - \frac{2\mu}{2\sigma^2}= 0 \implies \mu = \frac{a+b}{2}
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \sigma} \mathcal{L}(\mu,\sigma) = -\frac{1}{\sigma}+\frac{\frac{1}{3}(b^2+a^2+ab)-\frac{1}{4}(b+a)^2}{\sigma^3} =0 \implies \sigma^2 = \frac{(b-a)^2}{12}
\end{equation}</script>
<p>Although I wouldn’t have guessed this result the careful reader will notice that this result readily generalises to higher dimensions.</p>
<h2 id="analysing-the-loss-with-respect-to-optimal-gaussians">Analysing the loss with respect to optimal Gaussians:</h2>
<p>After entering the optimal values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> into <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> and simplifying the resulting expression we have
the following residual loss:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{L}^* = -\frac{1}{2}(\ln \big(\frac{\pi}{6}\big)+1) \approx -.17
\end{equation}</script>
<p>I find this result surprising because I didn’t expect the dependence on <script type="math/tex">\Delta = b-a</script> to vanish. That said, my current intuition for this result
is that if we tried fitting <script type="math/tex">\mathcal{U}(a,b)</script> to <script type="math/tex">\mathcal{N}(\mu,\sigma)</script> we would obtain:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
a = \mu - \sqrt{3}\sigma \\
b = \mu + \sqrt{3}\sigma
\end{cases}
\end{equation}</script>
<p>so this minimisation problem corresponds to a linear re-scaling of the uniform parameters in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>.</p>
<h2 id="remark">Remark:</h2>
<p>The reader may experiment with <a href="https://gist.github.com/AidanRocke/0a3ff41c8421a974640742d57bee8b71">the following TensorFlow function</a> which outputs
the approximating mean and variance of a Gaussian given a uniform distribution on the interval <script type="math/tex">[a,b]</script>.</p>Aidan RockeMotivation:What is the role of logic in Mathematics?2017-11-27T00:00:00+00:002017-11-27T00:00:00+00:00/mathematics/2017/11/27/platonic_math<h2 id="introduction">Introduction:</h2>
<p>The orthodox belief among pure mathematicians is that the foundations of mathematics are grounded in a few sacred axioms
and set theory where logic naturally has a central role in its development. However, by means of a simple thought experiment
I show that curiosity, more than logic, is essential for the development of mathematics. Moreover, I argue that
curiosity is firmly grounded in both our sensorimotor experience and the tools we use for doing mathematics.</p>
<p>This leads to a holistic account of the foundations of mathematics which challenges the Platonic notion that
‘pure’ mathematics is discovered and makes the case that the envelope of potential mathematical
discoveries is parametrised by both human morphology and technologies for doing mathematics. Crucially, this ‘Cyborg’ view
of mathematics has important implications for investigations on the foundations of mathematics as well as the manner
mathematics is taught at the university level.</p>
<h2 id="the-role-of-logic-in-mathematics">The role of logic in mathematics:</h2>
<p>While the importance of axiomatics and set theory in structuring mathematics is undeniable, I think we should not lose sight
of what logic actually provides:</p>
<ol>
<li>A system for verifying our discoveries to an axiomatic level of detail.</li>
<li>A method for communicating our mathematical discoveries in a convincing manner.</li>
</ol>
<p>In truth, the second argument has much greater weight than the first since an important consequence of Gödel’s incompleteness
theorems is that logic doesn’t guarantee the permanence of our mathematical discoveries. Furthermore, very few mathematicians
use formal proof assistants like Coq or Isabelle to write their mathematical proofs although proof assistants are practically
essential for verification at an axiomatic level of detail. How can we explain this?</p>
<p>Like all humans, mathematicians pursue rigor only to the extent that its cost justifies the reward. That said, if logical verification
isn’t essential to mathematics what could possibly be the vital force behind its development?</p>
<h2 id="the-importance-of-curiosity">The importance of curiosity:</h2>
<p>While I would grant that logical verification is important for problem solving in mathematics, if mathematics was reducible to
problem solving we would have no more than one mathematical question to answer(ex. 2+2=?) and there wouldn’t have been a field
of mathematics. In other words, there has to be some intrinsic motivation in all mathematicians which drives them to not only
solve problems but also seek out problems to solve. From this it follows that intrinsic motivation(or curiosity) has a much greater
role than logic in explaining why there are multiple branches of mathematics. In fact, this implies that curiosity not logic has to
be the vital force which guides its development.</p>
<p>Such a line of reasoning is especially relevant to investigations on the foundations of mathematics as it immediately raises doubts
on the platonic account of mathematics. This however raises important epistemological questions concerning the nature of curiosity.</p>
<h2 id="the-origin-and-development-of-mathematics">The origin and development of mathematics:</h2>
<p>In [2], Poincaré famously argues that primitive mathematical notions like size, continuity and number have imprecise perceptual origins. A child can learn to tell the difference in size between a big dog and a small dog without having to first learn about the greater than relation. Such perceptual faculties effectively serve as good priors for learning mathematics, a task which would be considerably harder otherwise. In addition, there is a wide range of scientific evidence presented in [1] demonstrating that-besides being the origin of our mathematical knowledge-our sensorimotor experience is an essential guide in our mathematical development. This means that our curiosity is constrained by both our morphology and the tools we use for doing mathematics.</p>
<p>While mathematical reasoning often conforms to mathematical principles, it is typically implemented in a sensorimotor loop which includes a device for data-input(ex. pen/pencil) and material for data-storage(ex. paper). In this context, the authors of [1] advance a Cyborg view of mathematics:</p>
<blockquote>
<p>…the active manipulation of physical notations plays the role of ‘guiding’ the biological machinery through an abstract mathematical problem space-one that may exceed the space of otherwise solveable problems.</p>
</blockquote>
<p>Although many mathematicians might contest this, I wonder whether any mathematician can do advanced mathematics without pen and paper, or a functional substitute. We must also acknowledge the increasingly important role of the computer for doing research-level mathematics.</p>
<p>In addition, we must note a more subtle but equally significant technology; mathematical notation has evolved over time by a process which isn’t arbitrary. While the space of satisfactory mathematical notations might be large, most randomly generated notations are bad for doing mathematics which is why mathematicians define <a href="https://mathoverflow.net/questions/42929/suggestions-for-good-notation">rules of thumb for good notation</a>. The triumph of Leibniz notation over Newton’s notation is a concrete example of this. Moreover, Terrence Tao once wrote a full <a href="https://terrytao.wordpress.com/advice-on-writing-papers/use-good-notation/">blog post</a> on this issue which includes the following quote due to Alfred North Whitehead:</p>
<blockquote>
<p>By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race.</p>
</blockquote>
<p>Yet, this statement flies in the face of Cognitive Science orthodoxy as stated in [1]:</p>
<blockquote>
<p>Cognitive scientists have traditionally viewed this capacity-the capacity for symbolic reasoning-as grounded in the ability to internally represent numbers, logical relationships, and mathematical rules in an abstract, amodal fashion.</p>
</blockquote>
<p>Clearly, this line of reasoning is absurd. If anything both scientific and empirical evidence strongly indicates that our sensorimotor experience is an essential substrate for mathematical thought and not merely a translational medium. When combined with the importance of curiosity it follows that we
have to encourage individual experimentation with technologies aiding mathematical activity in order to maximise the collective human potential for
mathematical discovery.</p>
<h2 id="conclusion">Conclusion:</h2>
<p>Having laid out these arguments, I think it’s clear that the Cyborg view of mathematics provides more stable foundations for mathematics than the orthodox view which is not only scientifically and empirically baseless, but also diminishes our collective potential for mathematical discovery. In particular, I would like to point out a few key innovations in the Cyborg tradition which have yet to be fully appreciated at the university level.</p>
<p>The first is the use of online blogs for communicating mathematical ideas as written homework/projects can be very isolating rather than engaging. You generally get very little feedback even if you do get a good mark which trivialises the activity. Second, is the creation of <a href="https://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/">Polymath projects</a> for exploring the role of large-scale self-organizing collaboration among students. Finally, I think mathematicians of all levels of ability can benefit from using <a href="http://jupyter.org/">Jupyter notebooks</a> for interactive experimental mathematics as I have whenever investigating problems in combinatorics or probability.</p>
<p>In my opinion, these innovations indicate yet-unrealised potential. Indeed, I believe that if the majority of mathematicians transition towards a Cyborg perspective of mathematical foundations, we shall witness a much more creative period of mathematics.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>A perceptual account of symbolic reasoning (David Landy, Colin Allen & Carlos Zednik. 2014. frontiers in Psychology.)</p>
</li>
<li>
<p>La Science et L’Hypothèse (Henri Poincaré. 2014. Champs Sciences.)</p>
</li>
</ol>Aidan RockeIntroduction:The theoretical limitations of DQN2017-08-29T00:00:00+00:002017-08-29T00:00:00+00:00/inference/2017/08/29/dqn<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/dqn.jpg" align="middle" /></center>
<h1 id="introduction">Introduction:</h1>
<p>Less than three years after the publication of Deep Mind’s publication ‘Playing Atari with Deep Reinforcement Learning’
the practical impact of this method on RL literature has been profound, as evidenced by the above graphic. However, the
theoretical limitations of the original method haven’t been thoroughly investigated. As I will show, such an analysis
actually clarifies the evolution of DQN and highlights which research directions are worth prioritising.</p>
<h1 id="background-on-dqn">Background on DQN:</h1>
<p>The main idea behind Deep Q-learning, hereafter referred to as DQN, is that given actions <script type="math/tex">a \in \mathcal{A}</script> and states <script type="math/tex">x \in X</script> in a Markov
Decision Process(MDP), it’s sufficient to optimise action selection with respect to the expected return:</p>
<script type="math/tex; mode=display">\begin{equation}
Q_{\pi}(x,a) = \mathbb{E} [\sum_{t=0}^{\infty} \gamma^t R(x_t,a_t)], \gamma \in (0,1)
\end{equation}</script>
<p>In particular the aim is to approximate a parametrised value function <script type="math/tex">Q(x,a;\theta_t)</script> where estimation is shifted towards the target:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta_t)
\end{equation}</script>
<p>and gradient descent updates are done as follows:</p>
<script type="math/tex; mode=display">\begin{equation}
\theta_{t+1} = \theta_t + \alpha(Y_t^Q-Q(S_t,A_t;\theta_t)) \nabla_{\theta} Q(S_t,A_t;\theta_t)
\end{equation}</script>
<p>In addition, epsilon-greedy approaches are used for exploration and to avoid estimates that merely reflect
recent experience the authors of DQN regularly allow the network to perform experience replay: batch updates
based on less recent experience.</p>
<p>Given the above description of DQN, we may note the following:</p>
<ol>
<li>Selection and evaluation in DQN is done with respect to the same parameters <script type="math/tex">\theta_t</script>.</li>
<li>Assuming that variance is unavoidable, the <script type="math/tex">\max</script> operator in (2) leads to over-optimistic estimates.</li>
<li>The expression in (1) provides an asymptotic guarantee which implicitly requires an ergodic MDP.</li>
</ol>
<p>These issues shall be addressed in the sections that follow.</p>
<h1 id="asymptotic-nonsense-or-the-data-inefficiency-of-dqn">Asymptotic nonsense or the data-inefficiency of DQN:</h1>
<p>In the simple case of i.i.d. data <script type="math/tex">X_i</script> if <script type="math/tex">S_n = \sum_{i=1}^{n} X_i</script> and <script type="math/tex">\mathbb{E}[X_i] = \mu</script>, a simple application of Chebyshev’s inequality gives:</p>
<script type="math/tex; mode=display">\begin{equation}
\forall \epsilon > 0, P(|\frac{S_n}{n}-\mu| > \epsilon) \leq \frac{\sigma}{n \epsilon^2}
\end{equation}</script>
<p>Essentially, this inequality shows that even in simple scenarios convergence in expectation requires a lot of data
and the rate of convergence depends on the variance <script type="math/tex">\sigma</script>. Furthermore, we must note that this inequality ignores
the following facts:</p>
<ol>
<li>For fixed <script type="math/tex">(x,a)</script>, <script type="math/tex">Q_{\pi}(x,a)</script> is rarely unimodal in practice.</li>
<li><script type="math/tex">Q_{\pi}(x,a)</script> rarely has negligible variance.</li>
<li>Our data is sequential and hardly ever i.i.d.</li>
</ol>
<p>From these points it follows that important estimation errors are unavoidable but as I will show, this isn’t the main
problem.</p>
<h1 id="the-unreasonable-optimism-of-dqn">The unreasonable optimism of DQN:</h1>
<ol>
<li>
<p>Over-optimism with respect to estimation errors:</p>
<p>The authors in [3] highlight that in (2), evaluation of the target <script type="math/tex">Y_t^Q</script> and action selection are done with respect to
the same parameters <script type="math/tex">\theta_t</script> which over-optimistic value estimates more likely with respect to the <script type="math/tex">\max</script> operator.
This suggests that estimation errors of any kind are more likely to result in overly-optimistic policies.</p>
<p>While this is problematic, the authors of [3] discovered the following elegant solution:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta'_{t})
\end{equation}</script>
<p>The resulting method, known as Double DQN, essentially decouples selection and evaluation by using two sets of weights <script type="math/tex">\theta</script>
and <script type="math/tex">\theta'</script>.</p>
</li>
<li>
<p>Over-optimism with respect to risk regardless of estimation error:</p>
<p>Consider the classic problem in decision theory of having to choose between an envelope <script type="math/tex">A</script> which contains $90.00 and envelope
<script type="math/tex">B</script> which contains $200.00 or $0.00 with equal probability. Although <script type="math/tex">Var[A] \ll Var[B]</script>, our agent’s
ignorance of the bimodality of <script type="math/tex">B</script> would lead it to act in an over-optimistic fashion. Due to the <script type="math/tex">\max</script> operator
it would make a decision solely based on the fact that <script type="math/tex">\mathbb{E}[B] > \mathbb{E}[A]</script>.</p>
<p>The above problem clearly requires a very different perspective.</p>
</li>
</ol>
<p>Two papers which address the second problem are [5] and [7]. While I won’t go into either paper in any detail I would recommend that the
reader start with [5] which provides an elegant and scalable solution with what can be thought of as a data-dependent
version of dropout [8]. The consideration of value distributions helps reduce uncertainty and improve inference.</p>
<h1 id="the-latent-value-of-hierarchical-models">The latent value of hierarchical models:</h1>
<p>Perhaps the most important question when considering the evolution of DQN is how will these agents develop rich conceptual abstractions
that will allow scientific induction or generalisation. Although one can argue that a DQN learns good statistical representations of
environmental states <script type="math/tex">x</script> it doesn’t learn any higher-order abstractions such as concepts. Moreover, vanilla DQN is purely reactive
and doesn’t incorporate planning in any meaningful sense. This is where Hierarchical Deep Reinforcement Learning can play a very important role.</p>
<p>In particular, I would like to mention the promising work of Tejas Kulkarni who investigated the use of hierarchical DQN, which has the following architecture:</p>
<ol>
<li>Controller: which learns policies in order to satisfy particular goals</li>
<li>Meta-Controller: which chooses goals</li>
<li>Critic: which evaluates whether a goal has been achieved</li>
</ol>
<p>Together these three components cooperate so that a high-level policy is learned over intrinsic goals and a lower-level policy is learned
over ‘atomic’ actions to satisfy the given goals. The work, which I’ve only vaguely described, opens up a lot of interesting
research directions which may not seem immediately obvious. One I’d like to mention is the possibility of learning a
grammar over policies. I think this might be a necessary component for the emergence of language in machines.</p>
<p>The interpretation of the ‘Critic’ is also very interesting. Perhaps one can argue that it provides the agent with a rudimentary form of
introspection.</p>
<h1 id="conclusion">Conclusion:</h1>
<p>I find it remarkable that a simple method such as DQN should inspire many new approaches. Perhaps it’s not so much the brilliance
of the method but rather its generality which allowed this method to adapt and evolve. In particular, I think the coupling
of Distributional RL with Hierarchical Deep RL has a very bright future. Together, this will lead to signficant improvements in terms of inference and generalisation.</p>
<p><strong>Note:</strong> The graphic is taken from [9].</p>
<h1 id="references">References:</h1>
<ol>
<li>C. J. C. H. Watkins, P. Dayan. Q-learning. 1992.</li>
<li>V. Minh, K. Kavukcuoglu, D. Silver et al. Playing Atari with Deep Reinforcement Learning. 2015.</li>
<li>H. van Hasselt ,A. Guez and D. Silver. Deep Reinforcement Learning with Double Q-learning. 2015.</li>
<li>Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. 2017.</li>
<li>Ian Osband, Charles Blundell, Alexander Pritzel and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN. 2016.</li>
<li>Tejas Kulkarni et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. 2016.</li>
<li>Marc G. Bellemare, Will Dabney and Rémi Munos. A Distributional Perspective on Reinforcement Learning. 2017.</li>
<li>Yarin Gal & Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. 2016.</li>
<li>Niels Justesen, Philip Bontrager, Julian Togelius, Sebastian Risi. Deep Learning for Video Game Playing. 2017.</li>
</ol>Aidan RockeEntropy Maximization and intelligent behaviour2017-07-06T00:00:00+00:002017-07-06T00:00:00+00:00/intelligence/2017/07/06/maxent<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/forking_paths.png" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>Sergio Hernandez, a Spanish mathematician, recently shared some <a href="http://entropicai.blogspot.fr/2017/06/solved-atari-games.html">very interesting results</a> on the OpenAI gym environment which are based on a <a href="http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf">relatively unknown paper</a>
published by Dr. Wissner-Gross, a physicist trained at MIT. What is impressive about Wissner’s meta-heuristic is that it is succinctly described by three equations which try to maximize the future freedom of your agent. In this analysis, I summarize the method, present its strengths and weaknesses, and attempt to improve it by making an important modification to one of the equations.</p>
<h2 id="causal-entropic-forces">Causal entropic forces:</h2>
<p>In the following summary of Wissner’s meta-heuristic, it’s assumed that the agent has access to an approximate or exact simulator. A close reading of
the original paper [1] will show that this assumption is actually necessary.</p>
<h3 id="macrostates">Macrostates:</h3>
<p>For any open thermodynamic system, we treat the phase-space paths taken by the system <script type="math/tex">x(t)</script> over the time interval <script type="math/tex">[0,\tau]</script> as microstates
and partition them into macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script> using the equivalence relation[1]:</p>
<script type="math/tex; mode=display">\begin{equation}
x(t) \sim x'(t) \iff x(0) = x'(0)
\end{equation}</script>
<p>As a result, we can identify each macrostate <script type="math/tex">X_i</script> with a unique present system state <script type="math/tex">x(0)</script>. This defines a notion of causality over a time interval.</p>
<h3 id="causal-path-entropy">Causal path entropy:</h3>
<p>We can define the causal path entropy <script type="math/tex">S_c</script> of a macrostate <script type="math/tex">X_i</script> with the associated present system state <script type="math/tex">x(0)</script> as the path integral:</p>
<script type="math/tex; mode=display">\begin{equation}
S_c (X_i, \tau) = -k_B \int_{x(t)} P(x(t)|x(0)) \ln P(x(t)|x(0)) \,D x(t)
\end{equation}</script>
<p>where we have:</p>
<script type="math/tex; mode=display">\begin{equation}
P(x(t)| x(0)) = \int_{x^*(t)} P(x(t),x^*(t) |x(0)) \,D x^*(t)
\end{equation}</script>
<p>In (3) we basically integrate over all possible paths <script type="math/tex">x^*(t)</script> taken by the open system’s environment. In practice, this integral is intractable
and we must resort to approximations and the use of a sampling algorithm like Hamiltonian Monte Carlo [3].</p>
<h3 id="causal-entropic-force">Causal entropic force:</h3>
<p>A path-based causal entropic force <script type="math/tex">F</script> may be expressed as:</p>
<script type="math/tex; mode=display">\begin{equation}
F(X_0, \tau) = T_c \nabla_X S_c (X, \tau) |_{X_0}
\end{equation}</script>
<p>where <script type="math/tex">T_c</script> and <script type="math/tex">\tau</script> are two free parameters. This force basically brings us closer to macrostates <script type="math/tex">X_j</script> that
maximize <script type="math/tex">S_c (X_i, \tau)</script>. In essence the combination of equations (2), (3) and (4) maximize the number of future options
of our agent. This isn’t very different from what most people try to do in life but this meta-heuristic does have very important
limitations.</p>
<h2 id="limitations-of-the-causal-entropic-approach">Limitations of the Causal Entropic approach:</h2>
<ol>
<li>
<p>The Causal Entropic paper makes the implicit assumption that we have access to a reliable simulator of future states. In the
case of the OpenAI environments this isn’t a problem because environment simulators are provided but in general it’s a hard problem. Two useful approaches to this problem
are suggested by [4] and [5] using recurrent neural networks.</p>
</li>
<li>
<p>Maximizing your number of future options is not always a good idea. Sometimes fewer options are better provided that these are
more useful options. This is why for example, football players don’t always rush to the center of a football pitch, although from
that position they would maximize their number of future states i.e. possible positions on the pitch.</p>
</li>
</ol>
<p>In the next section I would like to show that it’s possible to find a practical solution to the second limitation by modifying
(3).</p>
<h2 id="causal-path-utility">Causal Path Utility:</h2>
<p>Assuming that a recurrent neural network is used to define potential macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script>, it’s reasonable to assume
that our agent’s understanding of the future evolves with time and therefore macrostates are a function of time. So we have <script type="math/tex">\{ X_i(t) \}_{i \in I}</script>
rather than <script type="math/tex">\{ X_i \}_{i \in I}</script>. In other words, our simulator which might be an RNN, will probably change its parameters and
even its topology over time.</p>
<p>In order to resolve the second limitation and encourage the agent to make confident decisions,
I propose that we replace <script type="math/tex">S_c(X, \tau)</script> with <script type="math/tex">U_c(X, \tau)</script> where:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
U_c (X_i, \tau) & = -\int_{x(t)} P(x(t)|x(0)) \ln (U(x(t)|x(0) e^{-Var[U(x(t)\mid x(0))]}) ,D x(t) \\
& = \mathbb{E}[-\ln U(x(t)|x(0))]+\mathbb{E}[Var[U(x(t)\mid x(0))]] \geq 0\end{split}
\end{equation} %]]></script>
<p>This not only has the added value of simplifying calculations but also allows us to disentangle the relative contributions of utility and uncertainty.
It must also be noted that the two expressions in (5) can be calculated in parallel although the uncertainty calculation is more computationally
expensive.</p>
<h2 id="discussion">Discussion:</h2>
<p>If we assume that the agent’s perception of the future doesn’t change much, it might perceive some future states to be ideal. This is
consistent with the empirical observation that many people believe certain accomplishments would bring them ‘genuine happiness’. In other
words, if the state space is compact and approximately time-invariant the agent’s optimal future macrostate converges to a fixed point [6].</p>
<p>While the notion of Causal Path Utility just occurred to me today, I believe that this is a very promising approach which I shall follow-up with concrete implementations very soon.</p>
<h1 id="references">References:</h1>
<ol>
<li>
<p>Causal Entropic Forces (A. D. Wissner-Gross & C.E. Freer. 2013. Physical Review Letters.)</p>
</li>
<li>
<p>Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (Yarin Gal & Zoubin Ghahramani. 2016. ICML. )</p>
</li>
<li>
<p>Stochastic Gradient Hamiltonian Monte Carlo ( Tianqi Chen, Emily Fox & Carlos Guestrin. 2014. ICML.)</p>
</li>
<li>
<p>Recurrent Environment Simulators (Silvia Chappa et al. 2017. ICLR.)</p>
</li>
<li>
<p>On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models (J. Schmidhuber. 2015.)</p>
</li>
<li>
<p>Fixed Point Theorems with Applications to Economics and
Game Theory (Border, Kim C. 1985. Cambridge University Press.)</p>
</li>
</ol>Aidan RockeThe weight transport problem2017-06-30T00:00:00+00:002017-06-30T00:00:00+00:00/deep/learning/2017/06/30/weight-transport<h2 id="introduction">Introduction:</h2>
<p>In an excellent paper published less than two years ago, Timothy Lillicrap, a theoretical neuroscientist at DeepMind, found
a simple yet reasonable solution to the weight transport problem. Essentially, Timothy and his co-authors showed that it’s
possible to do backpropagation with random weights and still obtain very competitive results on various benchmarks [2]. The
reason why this is really significant is that it marks an important step towards biologically plausible deep learning.</p>
<h2 id="the-weight-transport-problem">The weight transport problem:</h2>
<p>While backpropagation is a very effective approach for training deep neural networks, at present it’s not at all clear whether
the brain might actually use this method for learning. In fact, backprop has three biologically implausible requirements [1]:</p>
<ol>
<li>feedback weights must be the same as feedforward weights</li>
<li>forward and backward passes require different computations</li>
<li>error gradients must be stored separately from activations</li>
</ol>
<p>A biologically plausible solution to the second and third problems is to use an error propagation network with the same topology
as the feedforward network but used only for backpropagation of error signals. However, there is no known biological mechanism
for this error network to know the weights of the feedforward network. This makes the first requirement, weight symmetry, a
serious obstacle.</p>
<p>This is also known as the weight transport problem [3].</p>
<h2 id="random-synaptic-feedback">Random synaptic feedback:</h2>
<p>The solution proposed by Lillicrap et al. is based on two good observations:</p>
<ol>
<li>
<p>Any fixed random matrix <script type="math/tex">B</script> may serve as a substitute
for the original matrix <script type="math/tex">W</script> in backpropagation provided that on average we have:</p>
<script type="math/tex; mode=display">\begin{equation}
e^\top WB e > 0
\end{equation}</script>
<p>where <script type="math/tex">e</script> is the error in the network’s output. Geometrically, this is equivalent to requiring that <script type="math/tex">e^\top W</script> and <script type="math/tex">Be</script> are within
<script type="math/tex">90^{\circ}</script> of each other.</p>
</li>
<li>
<p>Over time we get better alignment between <script type="math/tex">W</script> and <script type="math/tex">B</script> due to the modified update rules which means that the first requirement becomes
easier to satisfy with more iterations.</p>
</li>
</ol>
<h2 id="a-simple-example">A simple example:</h2>
<p>Let’s consider a simple three layer linear neural network that is intended to approximate a linear mapping:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
h = W_0 x \\
y = W h \\
e = Tx -y
\end{cases}
\end{equation}</script>
<p>The loss is given by:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{L} = \frac{1}{2} e^\top e
\end{equation}</script>
<p>From this we may derive the following backpropagation update equations:</p>
<script type="math/tex; mode=display">\begin{equation}
\Delta W \propto \frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial e} \frac{\partial e}{\partial y} \frac{\partial y}{\partial W} = e \cdot -1 \cdot h = e h^\top
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\Delta W_0 \propto \frac{\partial \mathcal{L}}{\partial W_0} = \frac{\partial \mathcal{L}}{\partial e} \frac{\partial e}{\partial y} \frac{\partial y}{\partial h} \frac{\partial e}{\partial W_0} = e \cdot (-1) \cdot W \cdot x = -W^\top e x^\top
\end{equation}</script>
<p>Now the random synaptic feedback innovation is essentially to replace step <script type="math/tex">(5)</script> with:</p>
<script type="math/tex; mode=display">\begin{equation} \Delta W_0 \propto B e x^\top
\end{equation}</script>
<p>where <script type="math/tex">B</script> is a fixed random matrix. As a result, we no longer need explicit knowledge of the original weights in our update equations.
I actually implemented this method for a three-layer sigmoid (i.e. nonlinear) neural network and obtained <a href="https://github.com/pauli-space/weight_symmetry/blob/master/experiments/random_synaptic_feedback/three_layer.py">89.5% accuracy on the MNIST dataset
after 10 iterations</a>, a result
that is competitive with backpropagation.</p>
<h2 id="discussion">Discussion:</h2>
<p>In spite of its remarkable simplicity, Timothy Lillicrap’s solution to the weight transport problem is very effective and so I think it
deserves further investigation. In the near future I plan to implement random synaptic feedback for much larger sigmoid and ReLU networks
as well as recurrent neural networks in order to build upon the work of [1].</p>
<p>Considering all the approaches to biologically plausible deep learning attempted so far, I believe this work represents a very important step forward.</p>
<h2 id="references">References:</h2>
<ol>
<li>How Important Is Weight Symmetry in Backpropagation? (Qianli Liao, Joel Z. Leibo, Tomaso A. Poggio. 2016. AAAI.)</li>
<li>Random synaptic feedback weights support error backpropagation for deep learning(Lillicrap 2016. Nature communications.)</li>
<li>Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.</li>
</ol>Aidan RockeIntroduction:deep rectifier networks: preliminary observations2017-06-21T00:00:00+00:002017-06-21T00:00:00+00:00/deep/learning/2017/06/21/observations_1a<p>Approximately one week ago, I defined a <a href="http://paulispace.com/deep/learning/2017/06/15/experiment_1.html">set of experiments</a> in order to model the effects of dropout and unsupervised pre-training on deep rectifier networks. However, prior to running through the experiments I realised that this was an opportunity to develop my own personal research workflow. After more reflection I decided to follow this particular process:</p>
<ol>
<li>Define experiments: including methodology, experimental setup and working hypotheses</li>
<li>Share preliminary observations: in order for readers to understand where scientific intuitions come from and overcome writer’s block</li>
<li>Experimental analysis: detailed statistical analysis of experimental results including hypothesis testing</li>
<li>Theoretical analysis: theoretical analysis of experimental results</li>
<li>Further discussion: discuss phenomena that are worth investigating further</li>
</ol>
<p>The present blog post aims to go through a part of stage 2. In particular, today I aim to share interesting observations concerning vanilla
three-layer rectifier networks with 500 nodes per layer trained on the MNIST dataset without dropout or unsupervised pre-training.</p>
<h2 id="visualizing-binary--in-activation-space">Visualizing binary in activation space:</h2>
<div class="image-wrapper">
<img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/activation_space.png" alt="" />
<p class="image-caption">two dimensional embedding of binary activations</p>
</div>
<p>Above we have a two dimensional linear embedding of binary representations which was obtained by applying PCA to the concatenated output of hidden layers, where a binary mask was applied to the output of each layer. This method is inspired by [5] where the authors used a similar method to study local competition among subnetworks within deep rectifier networks. Although I didn’t manage to get clusters that are as well-separated as
R. Srivastava, we have clear evidence of emergent organisation among subnetworks within deep rectifier networks.</p>
<p>In particular, we may note that 1 is very near 2, 7 is near 9, 0 blends with 4. A Canadian AI researcher might argue that 0 is entangled with 4 [6]. However, the explained variance due to PCA(n=2) was around 40% which means that a lot of information was lost in the process of going from 1500 dimensions to 2 dimensions. This suggests that we might need a more reliable method for analysing variable disentangling.</p>
<h2 id="variable-disentangling">Variable disentangling:</h2>
<h3 id="the-average-euclidean-distance-between-representations-per-class">The average Euclidean distance between representations per class:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/average_euclidean_distance.png" align="middle" /></center>
<p>What the above heatmap shows is the average euclidean distance between binary representations for a particular class label, which is useful
as the average value gives an indication of the relative contribution of each node when predicting a particular class. In particular, we note
that 7 appears to be quite close to 9 but 0 doesn’t appear to be particularly close to 4. This is why I always use low dimensional visualizations
with caution.</p>
<p>I also tried a different approach for analysing variable disentangling which gave very interesting and unexpected results.</p>
<h3 id="fraction-of-nodes-shared-per-class">Fraction of nodes shared per class:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/nodes_shared.png" align="middle" /></center>
<p>The above heatmap shows that the fraction of pair-wise nodes shared per class is always above 90% which is quite surprising. Basically this means that
different subnetworks that are tasked with predicting different things often share at least 90% of their nodes. What this means is that there is basically
a core representation that is frequently reused with some small variations between each example and these small variations are very important. In some sense the deep rectifier network is very efficient at sharing resources and I believe this relates well to the notion of local competition described by R. Srivastava in [5]. I also think it merits further study.</p>
<p>Prior to studying the fraction of shared nodes between subnetworks, I imagined that the relative sparsity of activity in deep rectifier networks implied
that the above observation would be quite improbable. In fact, the mean activations per hidden layer is something I looked into as well.</p>
<h2 id="mean-activity-per-hidden-layer-per-epoch">Mean activity per hidden layer per epoch:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/mean_activations.png" align="middle" /></center>
<p><br /></p>
<p>If it’s not clear, the above set of histograms show the mean activations for each of the three layers for each of the five epochs. What I find interesting is that we observe:</p>
<ol>
<li>Convergence in distribution which was quantified using the conditional entropy.</li>
<li>The mean activation for the first hidden layer has a mode around 0.5 whereas the mean activation
for the second and third hidden layers have a mode around 0.7</li>
<li>This indicates that on average (0.7+0.7+0.5)/3=63% of the nodes are used at any given time. Based on what
I’ve read in [1] I would expect this fraction to decrease if we fix the width while we increase the depth of
the network but it appears that we don’t yet have a good mathematical model to predict the number of active
nodes given a dataset with a particular sample complexity.</li>
</ol>
<p>Now, although it wasn’t suggested in any of the papers I’ve read so far I figured that I could probably use
the mean activations per hidden layer to study variable-size representation as well as sparsity. My reasoning
was that if a particular class required subnetworks with more nodes than another class on average then this
would probably capture the notion of variable-size representation as described in [6]:</p>
<blockquote>
<p>Varying the number of active neurons allows a model to control the effective dimensionality of the representation
for a given input and the required precision. - X. Glorot, Y. Bengio & A. Bordes</p>
</blockquote>
<h2 id="variable-size-representation">Variable-size representation:</h2>
<div align="center">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th> rank </th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
</tr>
</thead>
<tbody>
<tr>
<th>variable size</th>
<td>1</td>
<td>9</td>
<td>3</td>
<td>7</td>
<td>6</td>
<td>5</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>10</td>
</tr>
</tbody>
</table>
</div>
<p><br /></p>
<p>This table effectively shows how the neural network represents the relative dimensionality of each class. The way I obtained this was by calculating the
average number of nodes used to predict each class and then ranking these values by size.</p>
<p>An interesting and essential follow-up question is whether this relative order is respected when we train a rectifier network with the same architecture on a sample of the 10 original classes. If we did pair-wise experiments for example we would have to do 45 experiments. If the relative order is difficult
to reproduce then we have a problem with the notion of variable size. Right now I am not sure whether there’s a simple theory that would explain how a neural network controls variable size. The only way to find out is to do the experiments.</p>
<h2 id="the-example-ordering-problem">The example ordering problem:</h2>
<p>Finally, I also tried to take a look at the example ordering problem, a limitation of gradient descent for training neural networks that was noted by [3].
As they noted, the relative contribution of each epoch to the model that emerges within a neural network isn’t representative of the information available
per epoch. In fact, we observe that the weights change in a much more important manner during the earlier epochs compared to later epochs:</p>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/weight_norms.png" align="middle" /></center>
<p>What the above plot shows is that the change in the weight norm is more important during the earlier epochs compared to the later epochs. This is consistent with the model of gradient descent obtained in [4] which emphasises an approximate isomorphism between gradient descent and high-dimensional
damped oscillators but this isn’t good news. Basically, this means that gradient descent is not a data efficient method for learning signals from
data.</p>
<p>I think that the only way to avoid this is to add temporal memory to networks so they may perform inference forwards and backwards in time. If I analysed
this problem further I would probably rediscover one of the many recurrent neural network architectures or I might discover my own architecture. Very often it’s useful to approach problems as if they haven’t been investigated before as that’s the only way to become a good theoretician.</p>
<h2 id="discussion">Discussion:</h2>
<p>This marks the end of my first observational study and I think you’ll agree with me that it has highlighted many questions that are worth further
investigation. So you might ask, what next? I plan to do more detailed observational studies on the following questions in the following order:</p>
<ol>
<li>Sparsity of representations as we increase the depth of a rectifier network while keeping the width constant</li>
<li>Stability of relative variable size for randomly chosen subclasses</li>
<li>Solutions to the example ordering problem</li>
</ol>
<p>I will continue to use the MNIST dataset but I will try to find models that take into account sample complexity for each of the above questions
so these models should generalise well. Once I’ve gone through these semi-formal observational studies which are useful for developing intuitions
I’ll proceed with the experiments I’ve defined earlier.</p>
<p><strong>Note:</strong> If you’d like to repeat this analysis, the code I used to perform this analysis is available <a href="https://github.com/pauli-space/deep_rectifiers">here</a>
but I would wait until the weekend because I’m going to make important changes. It’s a bit of a mess at the moment.</p>
<h2 id="references">References:</h2>
<ol>
<li>Representation Learning: A Review and New Perspectives (Y. Bengio et al. 2013. IEEE Transactions on Pattern Analysis and Machine Intelligence.)</li>
<li>Dropout: A Simple Way to Prevent Neural Networks from Overfitting (N. Srivastava et al. 2014. Journal of Machine Learning Research.)</li>
<li>Why Does Unsupervised Pre-training Help Deep Learning? (D. Erhan et al. 2010. Journal of Machine Learning Research.)</li>
<li>The Physical Systems behind Optimization (L. Yang et al. 2017.)</li>
<li>Understanding Locally Competitive Networks (R. Srivastava et al. 2015.)</li>
<li>Deep Sparse Rectifier Neural Networks (X. Glorot, A. Bordes & Y. Bengio. 2011. Journal of Machine Learning Research.)</li>
</ol>Aidan RockeApproximately one week ago, I defined a set of experiments in order to model the effects of dropout and unsupervised pre-training on deep rectifier networks. However, prior to running through the experiments I realised that this was an opportunity to develop my own personal research workflow. After more reflection I decided to follow this particular process: