Jekyll2018-04-24T09:14:28+00:00/Pauli Space
Investigations on the foundations for intelligence.
Understanding the free energy principle2018-04-12T00:00:00+00:002018-04-12T00:00:00+00:00/active/inference/2018/04/12/free_energy<h2 id="motivation">Motivation:</h2>
<p>My general interest in single-motivation theories stems from the belief that a common ancestor for all multi-cellular organisms might imply
common principles of intelligent behaviour. It’s a somewhat reductive hypothesis and as I argued last week, <a href="http://paulispace.com/statistics/2018/04/07/causal_path_entropy.html">some of these theories might be
too reductive</a>, but I think it’s a useful working hypothesis that can take
behavioural scientists very far<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. However, until recently I wasn’t properly acquainted with the free energy principle which, from a distance,
appears to be one of the more plausible single-motivation theories.</p>
<p>The free energy principle is a theory developed by Karl Friston and others to explain how biological systems tend to avoid disorder by limiting themselves
to a small number of favorable states. It comes across as a rather abstract mathematical theory but thanks to a <a href="http://romainbrette.fr/what-is-computational-neuroscience-xxix-the-free-energy-principle/">critical thought experiment</a> proposed by <a href="https://twitter.com/RomainBrette">Romain Brette</a> I found an opportunity to take
a closer look at this theory. In fact, I promised Brette that I would run a computer simulation demonstrating that his thought experiment rests upon flawed assumptions(<a href="https://github.com/pauli-space/Free_Energy_experiments">code here</a>).</p>
<p>In this context, the goal of this blog post is to explain the main idea of the free energy principle and dissect Romain Brette’s thought experiment
in order to develop a practical understanding of this theory.</p>
<h2 id="the-free-energy-principle">The Free Energy Principle:</h2>
<p>In [1], Karl Friston proposes that the Free Energy principle may be a rough guide to the brain and makes the following points:</p>
<ol>
<li>The free energy principle basically applies to any biological system that resists a tendency to disorder.</li>
<li>The free energy principle rests upon the fact that self-organising biological systems resist a tendency to disorder and therefore minimise entropy
of their sensory states.</li>
<li>Assuming that <script type="math/tex">m</script> corresponds to a generative model describing the biological system and <script type="math/tex">y</script> refers to the system’s sensory states, under ergodic assumptions, the entropy is:</li>
</ol>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation*}
\begin{split}
H(y) & = -\int P(y|m) \ln P(y|m) \,dy \\
& = \lim_{T \to \infty} \int_{0}^{T} - \ln P(y(t)|m) \,dt
\end{split}
\end{equation*} %]]></script>
<p>Now, given that entropy is the long-term average of surprise(think of a monte carlo simulation), agents must avoid surprising states where surprise is defined
relative to homeostatic conditions of that particular organism<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>The three points above are sufficient to understand Romain Brette’s thought experiment though I must emphasise that surprisal here is defined in terms of the
agent’s homeostatic conditions so minimisation of surprisal corresponds to both minimisation of epistemic uncertainty(i.e. unknown unkowns) as well as
statistical uncertainty(i.e. known unknowns).</p>
<h2 id="romain-brettes-thought-experiment">Romain Brette’s thought experiment:</h2>
<p>In Brette’s article, he summarises the free energy principle in the following manner:</p>
<blockquote>
<p>The free energy principle is the theory that the brain manipulates a probabilistic generative model of its sensory inputs,
which it tries to optimise by either changing the model(learning) or changing the inputs(by acting).</p>
</blockquote>
<p>Although I haven’t mentioned anything about the human brain so far, this is a relatively good summary, and Brette proceeds with
the following food vs. no food thought experiment:</p>
<ol>
<li>An agent has two kinds of observations/stimuli: food and the absence of food.</li>
<li>This agent has two possible actions: seek food or don’t seek food.</li>
<li>When the agent seeks food there’s a 20% probability of getting food.</li>
<li>When the agent doesn’t seek food there’s a 100% probability of getting no food.</li>
</ol>
<p>What should a surprise minimising agent do? Romain presents the following argument:</p>
<blockquote>
<p>What does the free energy principle tell us? To minimize surprise, it seems clear that I should sit: I am certain to not see food. No surprise at all. The proposed solution is that you have a prior expectation to see food. So to minimize the surprise, you should put yourself into a situation where you might see food, ie to seek food. This seems to work. However, if there is any learning at all, then you will quickly observe that the probability of seeing food is actually 20%, and your expectations should be adjusted accordingly. Also, I will also observe that between two food expeditions, the probability to see food is 0%. Once this has been observed, surprise is minimal when I do not seek food. So, I die of hunger. It follows that the free energy principle does not survive Darwinian competition.</p>
</blockquote>
<p>Basically, Romain argues that surprise is minimal when the organism doesn’t seek food assuming that Friston’s definition of surprisal corresponds to minimisation of
statistical uncertainty. Given that Friston’s surprisal is defined in terms of the agent’s homeostatic conditions, this assumption is precisely where Romain’s analysis
breaks down. It also helps to simulate such toy problems on a computer, if possible, because in a simulation you have to make every modelling assumption clear.</p>
<h2 id="a-reasonable-model-of-brettes-problem">A reasonable model of Brette’s problem:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/Free_Energy_experiments/master/diagram.png" align="middle" /></center>
<p>To simulate Romain’s problem, I made the following assumptions:</p>
<ol>
<li>We have an organism which has to eat <script type="math/tex">k</script> times on average in the last 24 hours and can eat at most once per hour.</li>
<li>The homeostatic conditions of our organism are given by a Gaussian distribution centered at <script type="math/tex">k</script> with unit variance, a Gaussian food critic if you will. This specifies that our organism should’t eat much less than <script type="math/tex">k</script> times a day and shouldn’t eat a lot more than <script type="math/tex">k</script> times a day. In fact, this explains why living organisms tend to
have masses that are normally distributed during adulthood.</li>
<li>A food policy consists of a 24-dimensional vector where the values range from 0.0 to 1.0 and we want to maximise the negative log probability that the total consumption is drawn from the Gaussian food critic.</li>
<li>Food policies are the output of a generative neural network(setup using TensorFlow) whose inputs are either one or zero to indicate a survival prior, with one indicating a preference for survival.</li>
<li>The backpropagation algorithm, in this case Adagrad [5], functions as a homeostatic regulator by updating the network with variations in the network weights proportional to the negative logarithmic loss(i.e. surprisal).</li>
</ol>
<p>Assuming <script type="math/tex">k=3</script>, I ran a simulation in the <a href="https://github.com/pauli-space/Free_Energy_experiments/blob/master/simulation.ipynb">following notebook</a> and found that the discovered food policy differs significantly from Romain’s expectation that the agent would choose to not look for food in order to minimise surprisal. In fact, our simple agent manages to get three meals per day on average so it survives.</p>
<p>Overall, this is a relatively simple problem with a fixed prior(i.e. fixed belief) as the organism doesn’t have to do more than eat. So I can minimise surprise directly but in general, if we have adjustable beliefs(ex. models of physics and their physical parameters/constants) then we have a much harder problem and that’s where I would need to use the KL-divergence and invoke free energy minimisation, rather than directly minimising surprisal. However, these models and their parameters would still be evaluated with respect to homeostatic constraints. This guarantees that the organism isn’t simply trying to minimise statistical uncertainty.</p>
<h2 id="conclusion">Conclusion:</h2>
<p>Until recently, the Free Energy Principle has been a constant source of mockery from neuroscientists who misunderstood it and so I hope that by growing a collection
of <a href="https://github.com/pauli-space/Free_Energy_experiments">free-energy motivated reinforcement learning examples on Github</a> we may finally have a constructive discussion
between scientists. Moreover, I have been asked whether it’s not immodest for Karl Friston to suggest that his theory might be a model for human behaviour. Well, my answer
to that question is the same answer I would give to the critics of Empowerment[7].</p>
<p>Let’s see how far ingenious implementations(i.e. experiments) using these formalisms can take us. That’s the only way we’ll know what the limitations of these
theories are.</p>
<h1 id="references">References:</h1>
<ol>
<li>The free-energy principle: a rough guide to the brain? (K. Friston. 2009.)</li>
<li>The Markov blankets of life: autonomy, active inference and the free energy principle (M. Kirchhoff, T. Parr, E. Palacios, K. Friston and J. Kiverstein. 2018.)</li>
<li>Free-Energy Minimization and the Dark-Room Problem (K. Friston, C. Thornton and A. Clark. 2012.)</li>
<li>What is computational neuroscience? (XXIX) The free energy principle (R. Brette. 2018.)</li>
<li>Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (J. Duchi, )</li>
<li>Empowerment — An Introduction. C. Salge et al. 2013.</li>
<li>Reward, Motivation, and Reinforcement Learning (P. Dayan and B. Balleine. 2002.)</li>
</ol>
<h1 id="footnotes">Footnotes:</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The notion of utility maximisation in economics, though limited, has been very useful for example. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In [2], homeostatic conditions of an organism are defined in terms of Markov Blankets which are equivalent to the boundaries of a system in a statistical sense. I would encourage the reader to go into that paper after going through this blog post but this concept isn’t essential for understanding Romain’s thought experiment, so we’ll ignore this formalism for now. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aidan RockeMotivation:Approximating Causal Path Entropy in Euclidean spaces2018-04-07T00:00:00+00:002018-04-07T00:00:00+00:00/statistics/2018/04/07/causal_path_entropy<h2 id="motivation">Motivation:</h2>
<p>Last week I spoke to Alex Gomez-Marin, a behavioural neuroscientist who had a passing interest in the theory of Causal Entropic Forces
about determining the Causal Entropic Force on a dimensionless particle contained in a 2-D heat reservoir. I promised to try and work
out an approximation of the Causal Entropic Force on the particle. Meanwhile,it has been almost a year since I last wrote an
<a href="http://paulispace.com/intelligence/2017/07/06/maxent.html">article on the matter</a> and since then I have developed a better understanding
of this theory which Dr. Wissner-Gross calls an ‘<script type="math/tex">E=mc^2</script> for intelligence’.</p>
<p>You might guess, from my slightly reticent tone, that I’m no longer the biggest fan of this theory. While I won’t lambast the theory
as Gary Marcus has done in the following <a href="https://www.newyorker.com/tech/elements/a-grand-unified-theory-of-everything">New Yorker article</a>
I now think that on the balance his criticism was spot on. To understand why, I shall present a constructive dissection of the theory by going
through its principles and simulating the toy problem of a particle in a heat reservoir(<a href="https://github.com/pauli-space/Causal_Path_Entropy">code here</a>).</p>
<h2 id="causal-entropic-forces">Causal entropic forces:</h2>
<p>In the following summary of Wissner’s meta-heuristic, it’s assumed that the agent has access to an approximate or exact simulator. A close reading of
the original paper [1] will show that this assumption is actually necessary.</p>
<h3 id="macrostates">Macrostates:</h3>
<p>For any open thermodynamic system, we treat the phase-space paths taken by the system <script type="math/tex">x(t)</script> over the time interval <script type="math/tex">[0,\tau]</script> as microstates
and partition them into macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script> using the equivalence relation[1]:</p>
<script type="math/tex; mode=display">\begin{equation}
x(t) \sim x'(t) \iff x(0) = x'(0)
\end{equation}</script>
<p>As a result, we can identify each macrostate <script type="math/tex">X_i</script> with a unique present system state <script type="math/tex">x(0)</script>. This defines a notion of causality over a time interval.</p>
<h3 id="causal-path-entropy">Causal path entropy:</h3>
<p>We can define the causal path entropy <script type="math/tex">S_c</script> of a macrostate <script type="math/tex">X_i</script> with the associated present system state <script type="math/tex">x(0)</script> as the path integral:</p>
<script type="math/tex; mode=display">\begin{equation}
S_c (X_i, \tau) = -k_B \int_{x(t)} P(x(t)|x(0)) \ln P(x(t)|x(0)) \,D x(t)
\end{equation}</script>
<p>where we have:</p>
<script type="math/tex; mode=display">\begin{equation}
P(x(t)| x(0)) = \int_{x^*(t)} P(x(t),x^*(t) |x(0)) \,D x^*(t)
\end{equation}</script>
<p>In (3) we basically integrate over all possible paths <script type="math/tex">x^*(t)</script> taken by the open/closed system’s environment. In practice, this integral is intractable
and we must resort to approximations which we shall discuss shortly.</p>
<h3 id="causal-entropic-force">Causal entropic force:</h3>
<p>A path-based causal entropic force <script type="math/tex">F</script> may be expressed as:</p>
<script type="math/tex; mode=display">\begin{equation}
F(X_0, \tau) = T_c \nabla_X S_c (X, \tau) |_{X_0}
\end{equation}</script>
<p>where <script type="math/tex">T_c</script> and <script type="math/tex">\tau</script> are two free parameters. This force basically brings us closer to macrostates <script type="math/tex">X_j</script> that
maximize <script type="math/tex">S_c (X_i, \tau)</script>. In essence the combination of equations (2), (3) and (4) maximize the number of future options
of our agent. This isn’t very different from what many people try to do in life but this meta-heuristic does have very important
limitations.</p>
<p>The main limitation is that the agent actually needs to have access to the true state-transition probabilities of its environment
and if such a model is to be learned, the authors of the original paper[1] don’t say how.</p>
<h2 id="a-toy-problem">A toy problem:</h2>
<p>When simulating the toy problem of a dimensionless particle in a square heat reservoir, I made the following assumptions:</p>
<ol>
<li>The room is a 10x10 square and the walls are inelastic.</li>
<li>Given that state is represented by the particle’s position and the room is convex, the euclidean distance is a good metric for measuring the difference between states.</li>
<li>Assuming that the Causal Path Entropy varies continuously over states, we have a second argument for discretisation and may use the max operator rather than the
nabla operator to discover local maxima.</li>
<li>Assuming that the Causal Path Entropy is proportional to a propensity for mixing, we may approximate variations in Causal Path Entropy with Euclidean proxy measures for diffusion such as average nearest neighbours and the radius of gyration.</li>
<li>The particle isn’t quite dimensionless though it’s relatively small with respect to the room which allows us to approximate the Causal Path Entropy with the Boltzmann Entropy.</li>
</ol>
<p>Considering these four assumptions, I tried using two proxy measures. I first tried using the average nearest neighbour measure as a proxy for dispersion though this wasn’t
quite as reliable as I hoped so I experimented with the radius of gyration of an ensemble of terminal states as a proxy for diffusion as suggested in [2]. Below is a figure demonstrating convergence to the centre of the room using the radius of gyration as a proxy measure:</p>
<center><img src="https://raw.githubusercontent.com/pauli-space/Causal_Path_Entropy/master/images/distance_from_centre_of_room.png" align="middle" /></center>
<p>Interestingly, the second measure performed much better than the first and I suspect that this is because the radius of gyration implicitly exploits the fact that
the square is convex and therefore the centre of the square may be identified with the largest inscribed circle. This begs the question as to how general these
proxy measures actually are and whether we can hope to efficiently calculate path entropy for non-trivial systems even if we assume that a simulator is in
fact available.</p>
<h2 id="an-e--mc2-for-intelligence">An <script type="math/tex">E = mc^2</script> for intelligence?</h2>
<p>To be fair with the Causal Entropic Forces theory, I think it’s necessary to compare it with other prominent single-motivation theories such as the Free Energy
Principle which aims to minimise prediction error and the theory of Empowerment which encourages agents to maximise their number of intrinsic options[3,4]. Unlike these other theories which are frameworks for learning, inference and decision-making the theory of Causal Entropic Forces is mainly a framework for decision making and simulation assuming that a simulator fo the environment is known to the agent. Moreover, given that an Empowerment maximising agent maximises its number of intrinsic options the Causal
Entropic Force is merely a third-rate Empowerment variant.</p>
<p>Finally, even in the event that such a simulator is available(ex. Chess/Go) you would actually need to design a clever search algorithm for that particular environment.
In non-trivial environments, you can’t actually use the nabla operator as proposed by Wissner-Gross to move the agent towards more promising states. For these
reasons, I think it’s completely silly to compare this five-page theory of ‘intelligence’ with Einstein’s labours on the theory of relativity.</p>
<h1 id="references">References:</h1>
<ol>
<li>Causal Entropic Forces (A. D. Wissner-Gross & C.E. Freer. 2013. Physical Review Letters.)</li>
<li>Causal Entropic Forces: Intelligent Behaviour, Dynamics and Pattern Formation (Hannes Hornischer. 2015. Masters Thesis.)</li>
<li>The free-energy principle: a rough guide to the brain? Friston. 2005.</li>
<li>Empowerment — An Introduction. C. Salge et al. 2013.</li>
</ol>Aidan RockeMotivation:Fractals with TensorFlow2018-03-25T00:00:00+00:002018-03-25T00:00:00+00:00/tensorflow/2018/03/25/tensorflow_fractals<center><img src="https://i.stack.imgur.com/f4ned.png" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>Last week, it occurred to me to experiment with Mandelbrot sequences with variable exponents and after
a few experiments using <a href="https://github.com/AidanRocke/TensorFlow-Fractals">TensorFlow-Fractals</a> I made
a couple <a href="https://math.stackexchange.com/questions/2705107/symmetries-of-mandelbrot-sets-with-integer-exponents">mathematical observations</a>
which surprised me a little. My principal interest in fractals besides mathematical beauty is that their
massively parallel nature makes them a good benchmark for GPUs. In fact, one of my projects in the near
future will be to simulate Quaternion fractals on GPUs with TensorFlow [2].</p>
<p>Before continuing, I must say that from a mathematical perspective everything here is rather naive
but my philosophy is that it’s always better to get started and add more layers of sophistication later.</p>
<h2 id="the-mandelbrot-sequence">The Mandelbrot sequence:</h2>
<p>Mandelbrot sets are defined in terms of the following quadratic sequence in the complex plane:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = z_n^2 + c \\
c = z_0 \in \mathbb{C}
\end{cases}
\end{equation}</script>
<p>Using this sequence, the Mandelbrot is normally defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
M = \{z_0 \in \mathbb{C}: \lim_{n \to \infty} |z_n| < \infty \}
\end{equation} %]]></script>
<p>Now, given that <script type="math/tex">z_n</script> might be an oscillating sequence we need to resort to a few approximations in order to simulate Mandelbrot sets
on a computer. Here’s a short list:</p>
<ol>
<li>Finite precision.</li>
<li>Stopping criteria for divergence.</li>
<li>Stopping criteria for the number of iterates.</li>
</ol>
<p>To address these issues we use 32-bit floating point numbers, a pre-defined upper-bound on the modulus of <script type="math/tex">z_n</script> and a limit on the number of iterations.
With an upper-bound of 7.0 and a limit of 500 iterations, the reader should obtain an image similar to the following figure:</p>
<center><img src="https://i.stack.imgur.com/MRDvL.png" align="middle" style="width:600px;height:600px;" /></center>
<p>This is as much as I will say about Mandelbrot sets although if the reader is interested in learning more, I highly recommend the
<a href="http://mathworld.wolfram.com/MandelbrotSet.html">primer on Wolfram MathWorld</a>.</p>
<h2 id="generalised-mandelbrot-sequences">Generalised Mandelbrot sequences:</h2>
<p>Things became interesting when I experimented with recursive equations of the form:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = z_n^\alpha + c \\
c = z_0 \in \mathbb{C}, \alpha \in \mathbb{Z}
\end{cases}
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = \overline{z_n}^\alpha + c \\
c = z_0 \in \mathbb{C}, \alpha \in \mathbb{Z}
\end{cases}
\end{equation}</script>
<p>Using equation <script type="math/tex">(3)</script> I obtained the following images for exponents of <script type="math/tex">-2.0</script> and <script type="math/tex">-4.0</script>:</p>
<center><img src="https://i.stack.imgur.com/f4ned.png" align="middle" style="width:600px;height:600px;" /></center>
<p><br /></p>
<center><img src="https://i.stack.imgur.com/8FSK0.png" align="middle" style="width:600px;height:600px;" /></center>
<p>In fact, I made the following observations:</p>
<ol>
<li>Using equation <script type="math/tex">(3)</script>, the resulting structure has <script type="math/tex">\alpha-1</script> symmetries when <script type="math/tex">\alpha \geq 2</script> and <script type="math/tex">\lvert \alpha \rvert +1</script> symmetries when <script type="math/tex">\alpha \leq -2</script>.</li>
<li>Using equation <script type="math/tex">(4)</script>, the resulting structure has <script type="math/tex">\alpha+1</script> symmetries when <script type="math/tex">\alpha \geq 2</script> and <script type="math/tex">\lvert \alpha \rvert-1</script> symmetries when <script type="math/tex">\alpha \leq -2</script>.</li>
</ol>
<p>So far I don’t have a good explanation for these results but I hope to discover the reason behind the symmetries of these fractal structures
before the end of next week.</p>
<h2 id="whats-next">What’s next:</h2>
<p>Before investigating Quaternion Mandelbrot sets on GPUs, I would like to take a closer look at the following questions:</p>
<ol>
<li>Numerical stability as a function of <script type="math/tex">\alpha</script> and <script type="math/tex">z_0</script>.</li>
<li>Might there be better stopping criteria besides hard-coded bounds on the modulus of <script type="math/tex">z_n</script> and the maximum number of iterates.</li>
<li>Is the Mandelbrot set computable? (Note: this has been <a href="https://cs.stackexchange.com/questions/42685/in-what-sense-is-the-mandelbrot-set-computable">discussed on the CS stackexchange</a>.)</li>
</ol>
<p>These questions don’t quite fall under the category of intelligent behaviour but who knows? On the one hand, the Universe might just be a set of simple rules which are applied in a recursive manner. On the other hand, fractals provide researchers with an effective(and beautiful) way of benchmarking hardware and software performance.</p>
<p>Either way, the moral of the story is that playing with Mandelbrot sets is always an opportunity to learn something new about computation.</p>
<h1 id="references">References:</h1>
<ol>
<li>Fractal Art Generation using GPUs. Mayfield et al. 2016.</li>
<li>Ray Tracing Quaternion Julia Sets on the GPU. Keenan Crane. 2005.</li>
<li>Non-computable Julia Sets. M. Braverman, M. Yampolsky. 2005.</li>
</ol>Aidan RockeNormal approximation to uniform distribution2018-03-13T00:00:00+00:002018-03-13T00:00:00+00:00/statistics/2018/03/13/normal_approximation<h2 id="motivation">Motivation:</h2>
<p>Earlier today I was talking to a researcher about how well a normal distribution could approximate a uniform distribution
over an interval <script type="math/tex">[a,b] \subset \mathbb{R}</script>. I gave a few arguments for why I thought a normal distribution wouldn’t be good
but I didn’t have the exact answer at the top of my head so I decided to find out. Although the following analysis involves
nothing fancy I consider it useful as it’s easily generalised to higher dimensions(i.e. multivariate uniform distributions)
and we arrive at a result which I wouldn’t consider intuitive.</p>
<p>For those who appreciate numerical experiments, I wrote a small TensorFlow script to accompany this blog post.</p>
<h2 id="statement-of-the-problem">Statement of the problem:</h2>
<p>We would like to minimise the KL-Divergence:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{D}_{KL}(P|Q) = -\int_{-\infty}^\infty p(x) \ln \frac{p(x)}{q(x)}dx
\end{equation}</script>
<p>where <script type="math/tex">P</script> is the target uniform distribution and <script type="math/tex">Q</script> is the approximating Gaussian:</p>
<script type="math/tex; mode=display">\begin{equation}
p(x)= \frac{1}{b-a} \mathbb{1}_{[b-a]} \implies p(x \notin [b-a]) = 0
\end{equation}</script>
<p>and</p>
<script type="math/tex; mode=display">\begin{equation}
q(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x-\mu)^2}{2 \sigma^2}}
\end{equation}</script>
<p>Now, given that <script type="math/tex">\lim_{x \to 0} x\ln(x) = 0</script> if we assume that <script type="math/tex">(a,b)</script> is fixed our loss may be expressed in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\mathcal{L}(\mu,\sigma) & = -\int_{a}^b p(x) \ln \frac{p(x)}{q(x)}dx \\
& = \ln(b-a) - \frac{1}{2}\ln(2\pi\sigma^2)-\frac{\frac{1}{3}(b^3-a^3)-\mu(b^2-a^2)+\mu^2(b-a)}{2\sigma^2(b-a)} \end{split}
\end{equation} %]]></script>
<h2 id="minimising-with-respect-to-mu-and-sigma">Minimising with respect to <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</h2>
<p>We can easily show that the mean and variance of the Gaussian which minimises <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> correspond to the
mean and variance of a uniform distribution over <script type="math/tex">[a,b]</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \mu} \mathcal{L}(\mu,\sigma) = \frac{(b+a)}{2\sigma^2} - \frac{2\mu}{2\sigma^2}= 0 \implies \mu = \frac{a+b}{2}
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \sigma} \mathcal{L}(\mu,\sigma) = -\frac{1}{\sigma}+\frac{\frac{1}{3}(b^2+a^2+ab)-\frac{1}{4}(b+a)^2}{\sigma^3} =0 \implies \sigma^2 = \frac{(b-a)^2}{12}
\end{equation}</script>
<p>Although I wouldn’t have guessed this result the careful reader will notice that this result readily generalises to higher dimensions.</p>
<h2 id="analysing-the-loss-with-respect-to-optimal-gaussians">Analysing the loss with respect to optimal Gaussians:</h2>
<p>After entering the optimal values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> into <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> and simplifying the resulting expression we have
the following residual loss:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{L}^* = -\frac{1}{2}(\ln \big(\frac{\pi}{6}\big)+1) \approx -.17
\end{equation}</script>
<p>I find this result surprising because I didn’t expect the dependence on <script type="math/tex">\Delta = b-a</script> to vanish. That said, my current intuition for this result
is that if we tried fitting <script type="math/tex">\mathcal{U}(a,b)</script> to <script type="math/tex">\mathcal{N}(\mu,\sigma)</script> we would obtain:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
a = \mu - \sqrt{3}\sigma \\
b = \mu + \sqrt{3}\sigma
\end{cases}
\end{equation}</script>
<p>so this minimisation problem corresponds to a linear re-scaling of the uniform parameters in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>.</p>
<h2 id="remark">Remark:</h2>
<p>The reader may experiment with <a href="https://gist.github.com/AidanRocke/0a3ff41c8421a974640742d57bee8b71">the following TensorFlow function</a> which outputs
the approximating mean and variance of a Gaussian given a uniform distribution on the interval <script type="math/tex">[a,b]</script>.</p>Aidan RockeMotivation:Laplace’s Demon and the Singularity2018-03-10T00:00:00+00:002018-03-10T00:00:00+00:00/myths/2018/03/10/singularity_myth<h2 id="introduction">Introduction:</h2>
<p>In my first year as a math undergraduate I remember a fast-talking physics
student who stated that if you folded a piece of rectangular paper enough
times its length would cover the distance between the Earth and the Moon.
Disbelieving, I thought for a few minutes and countered that given that paper
is inelastic the maximum length can’t exceed the length of the diagonal.
Nevertheless, that year I discovered that this belief was remarkably popular among
physics students whose chief interest was to imitate Richard Feynman.</p>
<p>A similar belief, known as the technological singularity, possesses the minds of
many AI researchers although its flaws are numerous and would be obvious to an
outsider with a basic understanding of science. If I may sum it up succinctly,
the singularity is the idea that at some point in the future AI engineers will
create a super human intelligence which will engineer exponentially smarter
versions of itself so humans would no longer have to do science anymore.</p>
<p>In the following article I argue that an irremediable flaw in the singularity notion
is equivalent to the problem of Laplace’s Demon which has been addressed
by statistical physicists in the past.</p>
<h2 id="main-points">Main points:</h2>
<ol>
<li>
<p>Entities whose ‘intelligence’ is monotonically increasing due to recursive self-evaluation converge to an all-knowing ‘super-intelligence’:</p>
<p>If I take ‘intelligence’ to mean a proxy measure of an agent’s degree of control over its environment,
aka Empowerment [1], then the above statement is equivalent to a theory of everything we
know and <em>everything we don’t know</em>. You would be literally extrapolating beyond what could
be reasonably justified. From this it follows that the above statement is plausible with
vanishing probability. I shall clarify this statement below.</p>
</li>
<li>
<p>Assuming discrete time steps, learning time is an exponential function of the median planning horizon:</p>
<p>In a stationary environment which has sparsely distributed spatio-temporal reward signals the
median planning horizon will tend to be quite large. Let’s suppose that by large we mean
10 time steps into the future and that the agent has a discrete action space of size 4.
In this case, the agent would need <script type="math/tex">\sim 4^{10}</script> action sequences in order to discover an
optimal policy. We haven’t even considered the case of survivor bias: adversarial action sequences where
the agent regularly obtains a reward due to chance but can’t reliably obtain this reward in the future
due to a sequence of actions. Furthermore, the case of a continuous action space isn’t very different
because you can construct an equivalence relation over actions which produce the same outcome or
goal within <script type="math/tex">n</script> time steps.</p>
<p>What this means is that without a good model of the environment, higher-level abstractions which
would allow the agent to reduce the learning time required, are non-existent. My next point reiterates this.</p>
</li>
<li>
<p>Assuming the existence of atomic actions, learning time is an exponential function of the degree of hierarchy/abstraction of the action space:</p>
<p>In 1814, none other than Pierre Simon Laplace argued that we could theoretically model all behaviour with some large
Newtonian many-body system. The exact quote is the following:</p>
<blockquote>
<p>We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.-Laplace</p>
</blockquote>
<p>Assuming Laplace is correct, if we make minimal assumptions about the environment, how long would it take for complex
locomotion behaviours to emerge from this system? Closer scrutiny of this question would reveal that it’s essentially
equivalent to the previous question. Assuming actions at the microscopic scale(ex. muscle activations), complex locomotion
behaviours require very long planning horizons. Without good prior knowledge(i.e. a model) of the
hierarchical/compositional behaviour of muscle tissues you might as well be running a simulation for ages
in order to obtain the appropriate samples.</p>
<p>The situation is actually worse than we suppose as Laplace is incorrect in the case of the observable Universe. Physicists since Boltzmann have demonstrated that information is lost over time so besides the combinatorial complexity of simulation there’s the issue of uncertainty propagation across the simulation with
every calculation.</p>
</li>
</ol>
<h2 id="summary">Summary:</h2>
<p>For the reasons I gave above I think it’s clear that in complex environments of which nothing is known, any reasonable interpretation of super-intelligence
is essentially equivalent to Laplace’s Demon. It follows that the asymptotic limit of an all-knowing super-intelligence is not only highly improbable.
Given our current scientific knowledge, it’s practically impossible.</p>
<p>Singularitarians forget that science is about quantifying what we don’t know via experiments and not attaining ‘human-level’ understanding. Barring omniscient robots, I don’t see a point in the future when human scientists will be out of a job.</p>
<h1 id="references">References:</h1>
<ol>
<li>Salge, Calkin & Polani. Empowerment – an Introduction. 2013.</li>
<li>Laplace. A philosophical essay on probabilities. 1814.</li>
</ol>Aidan RockeIntroduction:What is the role of logic in Mathematics?2017-11-27T00:00:00+00:002017-11-27T00:00:00+00:00/mathematics/2017/11/27/platonic_math<h2 id="introduction">Introduction:</h2>
<p>The orthodox belief among pure mathematicians is that the foundations of mathematics are grounded in a few sacred axioms
and set theory where logic naturally has a central role in its development. However, by means of a simple thought experiment
I show that curiosity, more than logic, is essential for the development of mathematics. Moreover, I argue that
curiosity is firmly grounded in both our sensorimotor experience and the tools we use for doing mathematics.</p>
<p>This leads to a holistic account of the foundations of mathematics which challenges the Platonic notion that
‘pure’ mathematics is discovered and makes the case that the envelope of potential mathematical
discoveries is parametrised by both human morphology and technologies for doing mathematics. Crucially, this ‘Cyborg’ view
of mathematics has important implications for investigations on the foundations of mathematics as well as the manner
mathematics is taught at the university level.</p>
<h2 id="the-role-of-logic-in-mathematics">The role of logic in mathematics:</h2>
<p>While the importance of axiomatics and set theory in structuring mathematics is undeniable, I think we should not lose sight
of what logic actually provides:</p>
<ol>
<li>A system for verifying our discoveries to an axiomatic level of detail.</li>
<li>A method for communicating our mathematical discoveries in a convincing manner.</li>
</ol>
<p>In truth, the second argument has much greater weight than the first since an important consequence of Gödel’s incompleteness
theorems is that logic doesn’t guarantee the permanence of our mathematical discoveries. Furthermore, very few mathematicians
use formal proof assistants like Coq or Isabelle to write their mathematical proofs although proof assistants are practically
essential for verification at an axiomatic level of detail. How can we explain this?</p>
<p>Like all humans, mathematicians pursue rigor only to the extent that its cost justifies the reward. That said, if logical verification
isn’t essential to mathematics what could possibly be the vital force behind its development?</p>
<h2 id="the-importance-of-curiosity">The importance of curiosity:</h2>
<p>While I would grant that logical verification is important for problem solving in mathematics, if mathematics was reducible to
problem solving we would have no more than one mathematical question to answer(ex. 2+2=?) and there wouldn’t have been a field
of mathematics. In other words, there has to be some intrinsic motivation in all mathematicians which drives them to not only
solve problems but also seek out problems to solve. From this it follows that intrinsic motivation(or curiosity) has a much greater
role than logic in explaining why there are multiple branches of mathematics. In fact, this implies that curiosity not logic has to
be the vital force which guides its development.</p>
<p>Such a line of reasoning is especially relevant to investigations on the foundations of mathematics as it immediately raises doubts
on the platonic account of mathematics. This however raises important epistemological questions concerning the nature of curiosity.</p>
<h2 id="the-origin-and-development-of-mathematics">The origin and development of mathematics:</h2>
<p>In [2], Poincaré famously argues that primitive mathematical notions like size, continuity and number have imprecise perceptual origins. A child can learn to tell the difference in size between a big dog and a small dog without having to first learn about the greater than relation. Such perceptual faculties effectively serve as good priors for learning mathematics, a task which would be considerably harder otherwise. In addition, there is a wide range of scientific evidence presented in [1] demonstrating that-besides being the origin of our mathematical knowledge-our sensorimotor experience is an essential guide in our mathematical development. This means that our curiosity is constrained by both our morphology and the tools we use for doing mathematics.</p>
<p>While mathematical reasoning often conforms to mathematical principles, it is typically implemented in a sensorimotor loop which includes a device for data-input(ex. pen/pencil) and material for data-storage(ex. paper). In this context, the authors of [1] advance a Cyborg view of mathematics:</p>
<blockquote>
<p>…the active manipulation of physical notations plays the role of ‘guiding’ the biological machinery through an abstract mathematical problem space-one that may exceed the space of otherwise solveable problems.</p>
</blockquote>
<p>Although many mathematicians might contest this, I wonder whether any mathematician can do advanced mathematics without pen and paper, or a functional substitute. We must also acknowledge the increasingly important role of the computer for doing research-level mathematics.</p>
<p>In addition, we must note a more subtle but equally significant technology; mathematical notation has evolved over time by a process which isn’t arbitrary. While the space of satisfactory mathematical notations might be large, most randomly generated notations are bad for doing mathematics which is why mathematicians define <a href="https://mathoverflow.net/questions/42929/suggestions-for-good-notation">rules of thumb for good notation</a>. The triumph of Leibniz notation over Newton’s notation is a concrete example of this. Moreover, Terrence Tao once wrote a full <a href="https://terrytao.wordpress.com/advice-on-writing-papers/use-good-notation/">blog post</a> on this issue which includes the following quote due to Alfred North Whitehead:</p>
<blockquote>
<p>By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race.</p>
</blockquote>
<p>Yet, this statement flies in the face of Cognitive Science orthodoxy as stated in [1]:</p>
<blockquote>
<p>Cognitive scientists have traditionally viewed this capacity-the capacity for symbolic reasoning-as grounded in the ability to internally represent numbers, logical relationships, and mathematical rules in an abstract, amodal fashion.</p>
</blockquote>
<p>Clearly, this line of reasoning is absurd. If anything both scientific and empirical evidence strongly indicates that our sensorimotor experience is an essential substrate for mathematical thought and not merely a translational medium. When combined with the importance of curiosity it follows that we
have to encourage individual experimentation with technologies aiding mathematical activity in order to maximise the collective human potential for
mathematical discovery.</p>
<h2 id="conclusion">Conclusion:</h2>
<p>Having laid out these arguments, I think it’s clear that the Cyborg view of mathematics provides more stable foundations for mathematics than the orthodox view which is not only scientifically and empirically baseless, but also diminishes our collective potential for mathematical discovery. In particular, I would like to point out a few key innovations in the Cyborg tradition which have yet to be fully appreciated at the university level.</p>
<p>The first is the use of online blogs for communicating mathematical ideas as written homework/projects can be very isolating rather than engaging. You generally get very little feedback even if you do get a good mark which trivialises the activity. Second, is the creation of <a href="https://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/">Polymath projects</a> for exploring the role of large-scale self-organizing collaboration among students. Finally, I think mathematicians of all levels of ability can benefit from using <a href="http://jupyter.org/">Jupyter notebooks</a> for interactive experimental mathematics as I have whenever investigating problems in combinatorics or probability.</p>
<p>In my opinion, these innovations indicate yet-unrealised potential. Indeed, I believe that if the majority of mathematicians transition towards a Cyborg perspective of mathematical foundations, we shall witness a much more creative period of mathematics.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>A perceptual account of symbolic reasoning (David Landy, Colin Allen & Carlos Zednik. 2014. frontiers in Psychology.)</p>
</li>
<li>
<p>La Science et L’Hypothèse (Henri Poincaré. 2014. Champs Sciences.)</p>
</li>
</ol>Aidan RockeIntroduction:The theoretical limitations of DQN2017-08-29T00:00:00+00:002017-08-29T00:00:00+00:00/inference/2017/08/29/dqn<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/dqn.jpg" align="middle" /></center>
<h1 id="introduction">Introduction:</h1>
<p>Less than three years after the publication of Deep Mind’s publication ‘Playing Atari with Deep Reinforcement Learning’
the practical impact of this method on RL literature has been profound, as evidenced by the above graphic. However, the
theoretical limitations of the original method haven’t been thoroughly investigated. As I will show, such an analysis
actually clarifies the evolution of DQN and highlights which research directions are worth prioritising.</p>
<h1 id="background-on-dqn">Background on DQN:</h1>
<p>The main idea behind Deep Q-learning, hereafter referred to as DQN, is that given actions <script type="math/tex">a \in \mathcal{A}</script> and states <script type="math/tex">x \in X</script> in a Markov
Decision Process(MDP), it’s sufficient to optimise action selection with respect to the expected return:</p>
<script type="math/tex; mode=display">\begin{equation}
Q_{\pi}(x,a) = \mathbb{E} [\sum_{t=0}^{\infty} \gamma^t R(x_t,a_t)], \gamma \in (0,1)
\end{equation}</script>
<p>In particular the aim is to approximate a parametrised value function <script type="math/tex">Q(x,a;\theta_t)</script> where estimation is shifted towards the target:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta_t)
\end{equation}</script>
<p>and gradient descent updates are done as follows:</p>
<script type="math/tex; mode=display">\begin{equation}
\theta_{t+1} = \theta_t + \alpha(Y_t^Q-Q(S_t,A_t;\theta_t)) \nabla_{\theta} Q(S_t,A_t;\theta_t)
\end{equation}</script>
<p>In addition, epsilon-greedy approaches are used for exploration and to avoid estimates that merely reflect
recent experience the authors of DQN regularly allow the network to perform experience replay: batch updates
based on less recent experience.</p>
<p>Given the above description of DQN, we may note the following:</p>
<ol>
<li>Selection and evaluation in DQN is done with respect to the same parameters <script type="math/tex">\theta_t</script>.</li>
<li>Assuming that variance is unavoidable, the <script type="math/tex">\max</script> operator in (2) leads to over-optimistic estimates.</li>
<li>The expression in (1) provides an asymptotic guarantee which implicitly requires an ergodic MDP.</li>
</ol>
<p>These issues shall be addressed in the sections that follow.</p>
<h1 id="asymptotic-nonsense-or-the-data-inefficiency-of-dqn">Asymptotic nonsense or the data-inefficiency of DQN:</h1>
<p>In the simple case of i.i.d. data <script type="math/tex">X_i</script> if <script type="math/tex">S_n = \sum_{i=1}^{n} X_i</script> and <script type="math/tex">\mathbb{E}[X_i] = \mu</script>, a simple application of Chebyshev’s inequality gives:</p>
<script type="math/tex; mode=display">\begin{equation}
\forall \epsilon > 0, P(|\frac{S_n}{n}-\mu| > \epsilon) \leq \frac{\sigma}{n \epsilon^2}
\end{equation}</script>
<p>Essentially, this inequality shows that even in simple scenarios convergence in expectation requires a lot of data
and the rate of convergence depends on the variance <script type="math/tex">\sigma</script>. Furthermore, we must note that this inequality ignores
the following facts:</p>
<ol>
<li>For fixed <script type="math/tex">(x,a)</script>, <script type="math/tex">Q_{\pi}(x,a)</script> is rarely unimodal in practice.</li>
<li><script type="math/tex">Q_{\pi}(x,a)</script> rarely has negligible variance.</li>
<li>Our data is sequential and hardly ever i.i.d.</li>
</ol>
<p>From these points it follows that important estimation errors are unavoidable but as I will show, this isn’t the main
problem.</p>
<h1 id="the-unreasonable-optimism-of-dqn">The unreasonable optimism of DQN:</h1>
<ol>
<li>
<p>Over-optimism with respect to estimation errors:</p>
<p>The authors in [3] highlight that in (2), evaluation of the target <script type="math/tex">Y_t^Q</script> and action selection are done with respect to
the same parameters <script type="math/tex">\theta_t</script> which over-optimistic value estimates more likely with respect to the <script type="math/tex">\max</script> operator.
This suggests that estimation errors of any kind are more likely to result in overly-optimistic policies.</p>
<p>While this is problematic, the authors of [3] discovered the following elegant solution:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta'_{t})
\end{equation}</script>
<p>The resulting method, known as Double DQN, essentially decouples selection and evaluation by using two sets of weights <script type="math/tex">\theta</script>
and <script type="math/tex">\theta'</script>.</p>
</li>
<li>
<p>Over-optimism with respect to risk regardless of estimation error:</p>
<p>Consider the classic problem in decision theory of having to choose between an envelope <script type="math/tex">A</script> which contains $90.00 and envelope
<script type="math/tex">B</script> which contains $200.00 or $0.00 with equal probability. Although <script type="math/tex">Var[A] \ll Var[B]</script>, our agent’s
ignorance of the bimodality of <script type="math/tex">B</script> would lead it to act in an over-optimistic fashion. Due to the <script type="math/tex">\max</script> operator
it would make a decision solely based on the fact that <script type="math/tex">\mathbb{E}[B] > \mathbb{E}[A]</script>.</p>
<p>The above problem clearly requires a very different perspective.</p>
</li>
</ol>
<p>Two papers which address the second problem are [5] and [7]. While I won’t go into either paper in any detail I would recommend that the
reader start with [5] which provides an elegant and scalable solution with what can be thought of as a data-dependent
version of dropout [8]. The consideration of value distributions helps reduce uncertainty and improve inference.</p>
<h1 id="the-latent-value-of-hierarchical-models">The latent value of hierarchical models:</h1>
<p>Perhaps the most important question when considering the evolution of DQN is how will these agents develop rich conceptual abstractions
that will allow scientific induction or generalisation. Although one can argue that a DQN learns good statistical representations of
environmental states <script type="math/tex">x</script> it doesn’t learn any higher-order abstractions such as concepts. Moreover, vanilla DQN is purely reactive
and doesn’t incorporate planning in any meaningful sense. This is where Hierarchical Deep Reinforcement Learning can play a very important role.</p>
<p>In particular, I would like to mention the promising work of Tejas Kulkarni who investigated the use of hierarchical DQN, which has the following architecture:</p>
<ol>
<li>Controller: which learns policies in order to satisfy particular goals</li>
<li>Meta-Controller: which chooses goals</li>
<li>Critic: which evaluates whether a goal has been achieved</li>
</ol>
<p>Together these three components cooperate so that a high-level policy is learned over intrinsic goals and a lower-level policy is learned
over ‘atomic’ actions to satisfy the given goals. The work, which I’ve only vaguely described, opens up a lot of interesting
research directions which may not seem immediately obvious. One I’d like to mention is the possibility of learning a
grammar over policies. I think this might be a necessary component for the emergence of language in machines.</p>
<p>The interpretation of the ‘Critic’ is also very interesting. Perhaps one can argue that it provides the agent with a rudimentary form of
introspection.</p>
<h1 id="conclusion">Conclusion:</h1>
<p>I find it remarkable that a simple method such as DQN should inspire many new approaches. Perhaps it’s not so much the brilliance
of the method but rather its generality which allowed this method to adapt and evolve. In particular, I think the coupling
of Distributional RL with Hierarchical Deep RL has a very bright future. Together, this will lead to signficant improvements in terms of inference and generalisation.</p>
<p><strong>Note:</strong> The graphic is taken from [9].</p>
<h1 id="references">References:</h1>
<ol>
<li>C. J. C. H. Watkins, P. Dayan. Q-learning. 1992.</li>
<li>V. Minh, K. Kavukcuoglu, D. Silver et al. Playing Atari with Deep Reinforcement Learning. 2015.</li>
<li>H. van Hasselt ,A. Guez and D. Silver. Deep Reinforcement Learning with Double Q-learning. 2015.</li>
<li>Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. 2017.</li>
<li>Ian Osband, Charles Blundell, Alexander Pritzel and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN. 2016.</li>
<li>Tejas Kulkarni et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. 2016.</li>
<li>Marc G. Bellemare, Will Dabney and Rémi Munos. A Distributional Perspective on Reinforcement Learning. 2017.</li>
<li>Yarin Gal & Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. 2016.</li>
<li>Niels Justesen, Philip Bontrager, Julian Togelius, Sebastian Risi. Deep Learning for Video Game Playing. 2017.</li>
</ol>Aidan RockeEntropy Maximization and intelligent behaviour2017-07-06T00:00:00+00:002017-07-06T00:00:00+00:00/intelligence/2017/07/06/maxent<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/forking_paths.png" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>Sergio Hernandez, a Spanish mathematician, recently shared some <a href="http://entropicai.blogspot.fr/2017/06/solved-atari-games.html">very interesting results</a> on the OpenAI gym environment which are based on a <a href="http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf">relatively unknown paper</a>
published by Dr. Wissner-Gross, a physicist trained at MIT. What is impressive about Wissner’s meta-heuristic is that it is succinctly described by three equations which try to maximize the future freedom of your agent. In this analysis, I summarize the method, present its strengths and weaknesses, and attempt to improve it by making an important modification to one of the equations.</p>
<h2 id="causal-entropic-forces">Causal entropic forces:</h2>
<p>In the following summary of Wissner’s meta-heuristic, it’s assumed that the agent has access to an approximate or exact simulator. A close reading of
the original paper [1] will show that this assumption is actually necessary.</p>
<h3 id="macrostates">Macrostates:</h3>
<p>For any open thermodynamic system, we treat the phase-space paths taken by the system <script type="math/tex">x(t)</script> over the time interval <script type="math/tex">[0,\tau]</script> as microstates
and partition them into macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script> using the equivalence relation[1]:</p>
<script type="math/tex; mode=display">\begin{equation}
x(t) \sim x'(t) \iff x(0) = x'(0)
\end{equation}</script>
<p>As a result, we can identify each macrostate <script type="math/tex">X_i</script> with a unique present system state <script type="math/tex">x(0)</script>. This defines a notion of causality over a time interval.</p>
<h3 id="causal-path-entropy">Causal path entropy:</h3>
<p>We can define the causal path entropy <script type="math/tex">S_c</script> of a macrostate <script type="math/tex">X_i</script> with the associated present system state <script type="math/tex">x(0)</script> as the path integral:</p>
<script type="math/tex; mode=display">\begin{equation}
S_c (X_i, \tau) = -k_B \int_{x(t)} P(x(t)|x(0)) \ln P(x(t)|x(0)) \,D x(t)
\end{equation}</script>
<p>where we have:</p>
<script type="math/tex; mode=display">\begin{equation}
P(x(t)| x(0)) = \int_{x^*(t)} P(x(t),x^*(t) |x(0)) \,D x^*(t)
\end{equation}</script>
<p>In (3) we basically integrate over all possible paths <script type="math/tex">x^*(t)</script> taken by the open system’s environment. In practice, this integral is intractable
and we must resort to approximations and the use of a sampling algorithm like Hamiltonian Monte Carlo [3].</p>
<h3 id="causal-entropic-force">Causal entropic force:</h3>
<p>A path-based causal entropic force <script type="math/tex">F</script> may be expressed as:</p>
<script type="math/tex; mode=display">\begin{equation}
F(X_0, \tau) = T_c \nabla_X S_c (X, \tau) |_{X_0}
\end{equation}</script>
<p>where <script type="math/tex">T_c</script> and <script type="math/tex">\tau</script> are two free parameters. This force basically brings us closer to macrostates <script type="math/tex">X_j</script> that
maximize <script type="math/tex">S_c (X_i, \tau)</script>. In essence the combination of equations (2), (3) and (4) maximize the number of future options
of our agent. This isn’t very different from what most people try to do in life but this meta-heuristic does have very important
limitations.</p>
<h2 id="limitations-of-the-causal-entropic-approach">Limitations of the Causal Entropic approach:</h2>
<ol>
<li>
<p>The Causal Entropic paper makes the implicit assumption that we have access to a reliable simulator of future states. In the
case of the OpenAI environments this isn’t a problem because environment simulators are provided but in general it’s a hard problem. Two useful approaches to this problem
are suggested by [4] and [5] using recurrent neural networks.</p>
</li>
<li>
<p>Maximizing your number of future options is not always a good idea. Sometimes fewer options are better provided that these are
more useful options. This is why for example, football players don’t always rush to the center of a football pitch, although from
that position they would maximize their number of future states i.e. possible positions on the pitch.</p>
</li>
</ol>
<p>In the next section I would like to show that it’s possible to find a practical solution to the second limitation by modifying
(3).</p>
<h2 id="causal-path-utility">Causal Path Utility:</h2>
<p>Assuming that a recurrent neural network is used to define potential macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script>, it’s reasonable to assume
that our agent’s understanding of the future evolves with time and therefore macrostates are a function of time. So we have <script type="math/tex">\{ X_i(t) \}_{i \in I}</script>
rather than <script type="math/tex">\{ X_i \}_{i \in I}</script>. In other words, our simulator which might be an RNN, will probably change its parameters and
even its topology over time.</p>
<p>In order to resolve the second limitation and encourage the agent to make confident decisions,
I propose that we replace <script type="math/tex">S_c(X, \tau)</script> with <script type="math/tex">U_c(X, \tau)</script> where:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
U_c (X_i, \tau) & = -\int_{x(t)} P(x(t)|x(0)) \ln (U(x(t)|x(0) e^{-Var[U(x(t)\mid x(0))]}) ,D x(t) \\
& = \mathbb{E}[-\ln U(x(t)|x(0))]+\mathbb{E}[Var[U(x(t)\mid x(0))]] \geq 0\end{split}
\end{equation} %]]></script>
<p>This not only has the added value of simplifying calculations but also allows us to disentangle the relative contributions of utility and uncertainty.
It must also be noted that the two expressions in (5) can be calculated in parallel although the uncertainty calculation is more computationally
expensive.</p>
<h2 id="discussion">Discussion:</h2>
<p>If we assume that the agent’s perception of the future doesn’t change much, it might perceive some future states to be ideal. This is
consistent with the empirical observation that many people believe certain accomplishments would bring them ‘genuine happiness’. In other
words, if the state space is compact and approximately time-invariant the agent’s optimal future macrostate converges to a fixed point [6].</p>
<p>While the notion of Causal Path Utility just occurred to me today, I believe that this is a very promising approach which I shall follow-up with concrete implementations very soon.</p>
<h1 id="references">References:</h1>
<ol>
<li>
<p>Causal Entropic Forces (A. D. Wissner-Gross & C.E. Freer. 2013. Physical Review Letters.)</p>
</li>
<li>
<p>Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (Yarin Gal & Zoubin Ghahramani. 2016. ICML. )</p>
</li>
<li>
<p>Stochastic Gradient Hamiltonian Monte Carlo ( Tianqi Chen, Emily Fox & Carlos Guestrin. 2014. ICML.)</p>
</li>
<li>
<p>Recurrent Environment Simulators (Silvia Chappa et al. 2017. ICLR.)</p>
</li>
<li>
<p>On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models (J. Schmidhuber. 2015.)</p>
</li>
<li>
<p>Fixed Point Theorems with Applications to Economics and
Game Theory (Border, Kim C. 1985. Cambridge University Press.)</p>
</li>
</ol>Aidan RockeThe weight transport problem2017-06-30T00:00:00+00:002017-06-30T00:00:00+00:00/deep/learning/2017/06/30/weight-transport<h2 id="introduction">Introduction:</h2>
<p>In an excellent paper published less than two years ago, Timothy Lillicrap, a theoretical neuroscientist at DeepMind, found
a simple yet reasonable solution to the weight transport problem. Essentially, Timothy and his co-authors showed that it’s
possible to do backpropagation with random weights and still obtain very competitive results on various benchmarks [2]. The
reason why this is really significant is that it marks an important step towards biologically plausible deep learning.</p>
<h2 id="the-weight-transport-problem">The weight transport problem:</h2>
<p>While backpropagation is a very effective approach for training deep neural networks, at present it’s not at all clear whether
the brain might actually use this method for learning. In fact, backprop has three biologically implausible requirements [1]:</p>
<ol>
<li>feedback weights must be the same as feedforward weights</li>
<li>forward and backward passes require different computations</li>
<li>error gradients must be stored separately from activations</li>
</ol>
<p>A biologically plausible solution to the second and third problems is to use an error propagation network with the same topology
as the feedforward network but used only for backpropagation of error signals. However, there is no known biological mechanism
for this error network to know the weights of the feedforward network. This makes the first requirement, weight symmetry, a
serious obstacle.</p>
<p>This is also known as the weight transport problem [3].</p>
<h2 id="random-synaptic-feedback">Random synaptic feedback:</h2>
<p>The solution proposed by Lillicrap et al. is based on two good observations:</p>
<ol>
<li>
<p>Any fixed random matrix <script type="math/tex">B</script> may serve as a substitute
for the original matrix <script type="math/tex">W</script> in backpropagation provided that on average we have:</p>
<script type="math/tex; mode=display">\begin{equation}
e^\top WB e > 0
\end{equation}</script>
<p>where <script type="math/tex">e</script> is the error in the network’s output. Geometrically, this is equivalent to requiring that <script type="math/tex">e^\top W</script> and <script type="math/tex">Be</script> are within
<script type="math/tex">90^{\circ}</script> of each other.</p>
</li>
<li>
<p>Over time we get better alignment between <script type="math/tex">W</script> and <script type="math/tex">B</script> due to the modified update rules which means that the first requirement becomes
easier to satisfy with more iterations.</p>
</li>
</ol>
<h2 id="a-simple-example">A simple example:</h2>
<p>Let’s consider a simple three layer linear neural network that is intended to approximate a linear mapping:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
h = W_0 x \\
y = W h \\
e = Tx -y
\end{cases}
\end{equation}</script>
<p>The loss is given by:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{L} = \frac{1}{2} e^\top e
\end{equation}</script>
<p>From this we may derive the following backpropagation update equations:</p>
<script type="math/tex; mode=display">\begin{equation}
\Delta W \propto \frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial e} \frac{\partial e}{\partial y} \frac{\partial y}{\partial W} = e \cdot -1 \cdot h = e h^\top
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\Delta W_0 \propto \frac{\partial \mathcal{L}}{\partial W_0} = \frac{\partial \mathcal{L}}{\partial e} \frac{\partial e}{\partial y} \frac{\partial y}{\partial h} \frac{\partial e}{\partial W_0} = e \cdot (-1) \cdot W \cdot x = -W^\top e x^\top
\end{equation}</script>
<p>Now the random synaptic feedback innovation is essentially to replace step <script type="math/tex">(5)</script> with:</p>
<script type="math/tex; mode=display">\begin{equation} \Delta W_0 \propto B e x^\top
\end{equation}</script>
<p>where <script type="math/tex">B</script> is a fixed random matrix. As a result, we no longer need explicit knowledge of the original weights in our update equations.
I actually implemented this method for a three-layer sigmoid (i.e. nonlinear) neural network and obtained <a href="https://github.com/pauli-space/weight_symmetry/blob/master/experiments/random_synaptic_feedback/three_layer.py">89.5% accuracy on the MNIST dataset
after 10 iterations</a>, a result
that is competitive with backpropagation.</p>
<h2 id="discussion">Discussion:</h2>
<p>In spite of its remarkable simplicity, Timothy Lillicrap’s solution to the weight transport problem is very effective and so I think it
deserves further investigation. In the near future I plan to implement random synaptic feedback for much larger sigmoid and ReLU networks
as well as recurrent neural networks in order to build upon the work of [1].</p>
<p>Considering all the approaches to biologically plausible deep learning attempted so far, I believe this work represents a very important step forward.</p>
<h2 id="references">References:</h2>
<ol>
<li>How Important Is Weight Symmetry in Backpropagation? (Qianli Liao, Joel Z. Leibo, Tomaso A. Poggio. 2016. AAAI.)</li>
<li>Random synaptic feedback weights support error backpropagation for deep learning(Lillicrap 2016. Nature communications.)</li>
<li>Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.</li>
</ol>Aidan RockeIntroduction:deep rectifier networks: preliminary observations2017-06-21T00:00:00+00:002017-06-21T00:00:00+00:00/deep/learning/2017/06/21/observations_1a<p>Approximately one week ago, I defined a <a href="http://paulispace.com/deep/learning/2017/06/15/experiment_1.html">set of experiments</a> in order to model the effects of dropout and unsupervised pre-training on deep rectifier networks. However, prior to running through the experiments I realised that this was an opportunity to develop my own personal research workflow. After more reflection I decided to follow this particular process:</p>
<ol>
<li>Define experiments: including methodology, experimental setup and working hypotheses</li>
<li>Share preliminary observations: in order for readers to understand where scientific intuitions come from and overcome writer’s block</li>
<li>Experimental analysis: detailed statistical analysis of experimental results including hypothesis testing</li>
<li>Theoretical analysis: theoretical analysis of experimental results</li>
<li>Further discussion: discuss phenomena that are worth investigating further</li>
</ol>
<p>The present blog post aims to go through a part of stage 2. In particular, today I aim to share interesting observations concerning vanilla
three-layer rectifier networks with 500 nodes per layer trained on the MNIST dataset without dropout or unsupervised pre-training.</p>
<h2 id="visualizing-binary--in-activation-space">Visualizing binary in activation space:</h2>
<div class="image-wrapper">
<img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/activation_space.png" alt="" />
<p class="image-caption">two dimensional embedding of binary activations</p>
</div>
<p>Above we have a two dimensional linear embedding of binary representations which was obtained by applying PCA to the concatenated output of hidden layers, where a binary mask was applied to the output of each layer. This method is inspired by [5] where the authors used a similar method to study local competition among subnetworks within deep rectifier networks. Although I didn’t manage to get clusters that are as well-separated as
R. Srivastava, we have clear evidence of emergent organisation among subnetworks within deep rectifier networks.</p>
<p>In particular, we may note that 1 is very near 2, 7 is near 9, 0 blends with 4. A Canadian AI researcher might argue that 0 is entangled with 4 [6]. However, the explained variance due to PCA(n=2) was around 40% which means that a lot of information was lost in the process of going from 1500 dimensions to 2 dimensions. This suggests that we might need a more reliable method for analysing variable disentangling.</p>
<h2 id="variable-disentangling">Variable disentangling:</h2>
<h3 id="the-average-euclidean-distance-between-representations-per-class">The average Euclidean distance between representations per class:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/average_euclidean_distance.png" align="middle" /></center>
<p>What the above heatmap shows is the average euclidean distance between binary representations for a particular class label, which is useful
as the average value gives an indication of the relative contribution of each node when predicting a particular class. In particular, we note
that 7 appears to be quite close to 9 but 0 doesn’t appear to be particularly close to 4. This is why I always use low dimensional visualizations
with caution.</p>
<p>I also tried a different approach for analysing variable disentangling which gave very interesting and unexpected results.</p>
<h3 id="fraction-of-nodes-shared-per-class">Fraction of nodes shared per class:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/nodes_shared.png" align="middle" /></center>
<p>The above heatmap shows that the fraction of pair-wise nodes shared per class is always above 90% which is quite surprising. Basically this means that
different subnetworks that are tasked with predicting different things often share at least 90% of their nodes. What this means is that there is basically
a core representation that is frequently reused with some small variations between each example and these small variations are very important. In some sense the deep rectifier network is very efficient at sharing resources and I believe this relates well to the notion of local competition described by R. Srivastava in [5]. I also think it merits further study.</p>
<p>Prior to studying the fraction of shared nodes between subnetworks, I imagined that the relative sparsity of activity in deep rectifier networks implied
that the above observation would be quite improbable. In fact, the mean activations per hidden layer is something I looked into as well.</p>
<h2 id="mean-activity-per-hidden-layer-per-epoch">Mean activity per hidden layer per epoch:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/mean_activations.png" align="middle" /></center>
<p><br /></p>
<p>If it’s not clear, the above set of histograms show the mean activations for each of the three layers for each of the five epochs. What I find interesting is that we observe:</p>
<ol>
<li>Convergence in distribution which was quantified using the conditional entropy.</li>
<li>The mean activation for the first hidden layer has a mode around 0.5 whereas the mean activation
for the second and third hidden layers have a mode around 0.7</li>
<li>This indicates that on average (0.7+0.7+0.5)/3=63% of the nodes are used at any given time. Based on what
I’ve read in [1] I would expect this fraction to decrease if we fix the width while we increase the depth of
the network but it appears that we don’t yet have a good mathematical model to predict the number of active
nodes given a dataset with a particular sample complexity.</li>
</ol>
<p>Now, although it wasn’t suggested in any of the papers I’ve read so far I figured that I could probably use
the mean activations per hidden layer to study variable-size representation as well as sparsity. My reasoning
was that if a particular class required subnetworks with more nodes than another class on average then this
would probably capture the notion of variable-size representation as described in [6]:</p>
<blockquote>
<p>Varying the number of active neurons allows a model to control the effective dimensionality of the representation
for a given input and the required precision. - X. Glorot, Y. Bengio & A. Bordes</p>
</blockquote>
<h2 id="variable-size-representation">Variable-size representation:</h2>
<div align="center">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th> rank </th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
</tr>
</thead>
<tbody>
<tr>
<th>variable size</th>
<td>1</td>
<td>9</td>
<td>3</td>
<td>7</td>
<td>6</td>
<td>5</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>10</td>
</tr>
</tbody>
</table>
</div>
<p><br /></p>
<p>This table effectively shows how the neural network represents the relative dimensionality of each class. The way I obtained this was by calculating the
average number of nodes used to predict each class and then ranking these values by size.</p>
<p>An interesting and essential follow-up question is whether this relative order is respected when we train a rectifier network with the same architecture on a sample of the 10 original classes. If we did pair-wise experiments for example we would have to do 45 experiments. If the relative order is difficult
to reproduce then we have a problem with the notion of variable size. Right now I am not sure whether there’s a simple theory that would explain how a neural network controls variable size. The only way to find out is to do the experiments.</p>
<h2 id="the-example-ordering-problem">The example ordering problem:</h2>
<p>Finally, I also tried to take a look at the example ordering problem, a limitation of gradient descent for training neural networks that was noted by [3].
As they noted, the relative contribution of each epoch to the model that emerges within a neural network isn’t representative of the information available
per epoch. In fact, we observe that the weights change in a much more important manner during the earlier epochs compared to later epochs:</p>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/weight_norms.png" align="middle" /></center>
<p>What the above plot shows is that the change in the weight norm is more important during the earlier epochs compared to the later epochs. This is consistent with the model of gradient descent obtained in [4] which emphasises an approximate isomorphism between gradient descent and high-dimensional
damped oscillators but this isn’t good news. Basically, this means that gradient descent is not a data efficient method for learning signals from
data.</p>
<p>I think that the only way to avoid this is to add temporal memory to networks so they may perform inference forwards and backwards in time. If I analysed
this problem further I would probably rediscover one of the many recurrent neural network architectures or I might discover my own architecture. Very often it’s useful to approach problems as if they haven’t been investigated before as that’s the only way to become a good theoretician.</p>
<h2 id="discussion">Discussion:</h2>
<p>This marks the end of my first observational study and I think you’ll agree with me that it has highlighted many questions that are worth further
investigation. So you might ask, what next? I plan to do more detailed observational studies on the following questions in the following order:</p>
<ol>
<li>Sparsity of representations as we increase the depth of a rectifier network while keeping the width constant</li>
<li>Stability of relative variable size for randomly chosen subclasses</li>
<li>Solutions to the example ordering problem</li>
</ol>
<p>I will continue to use the MNIST dataset but I will try to find models that take into account sample complexity for each of the above questions
so these models should generalise well. Once I’ve gone through these semi-formal observational studies which are useful for developing intuitions
I’ll proceed with the experiments I’ve defined earlier.</p>
<p><strong>Note:</strong> If you’d like to repeat this analysis, the code I used to perform this analysis is available <a href="https://github.com/pauli-space/deep_rectifiers">here</a>
but I would wait until the weekend because I’m going to make important changes. It’s a bit of a mess at the moment.</p>
<h2 id="references">References:</h2>
<ol>
<li>Representation Learning: A Review and New Perspectives (Y. Bengio et al. 2013. IEEE Transactions on Pattern Analysis and Machine Intelligence.)</li>
<li>Dropout: A Simple Way to Prevent Neural Networks from Overfitting (N. Srivastava et al. 2014. Journal of Machine Learning Research.)</li>
<li>Why Does Unsupervised Pre-training Help Deep Learning? (D. Erhan et al. 2010. Journal of Machine Learning Research.)</li>
<li>The Physical Systems behind Optimization (L. Yang et al. 2017.)</li>
<li>Understanding Locally Competitive Networks (R. Srivastava et al. 2015.)</li>
<li>Deep Sparse Rectifier Neural Networks (X. Glorot, A. Bordes & Y. Bengio. 2011. Journal of Machine Learning Research.)</li>
</ol>Aidan RockeApproximately one week ago, I defined a set of experiments in order to model the effects of dropout and unsupervised pre-training on deep rectifier networks. However, prior to running through the experiments I realised that this was an opportunity to develop my own personal research workflow. After more reflection I decided to follow this particular process: