Jekyll2018-03-16T09:52:24+00:00/Pauli Space
Investigations on the foundations for intelligence.
Normal approximation to uniform distribution2018-03-13T00:00:00+00:002018-03-13T00:00:00+00:00/statistics/2018/03/13/normal_approximation<h2 id="motivation">Motivation:</h2>
<p>Earlier today I was talking to a researcher about how well a normal distribution could approximate a uniform distribution
over an interval <script type="math/tex">[a,b] \subset \mathbb{R}</script>. I gave a few arguments for why I thought a normal distribution wouldn’t be good
but I didn’t have the exact answer at the top of my head so I decided to find out. Although the following analysis involves
nothing fancy I consider it useful as it’s easily generalised to higher dimensions(i.e. multivariate uniform distributions)
and we arrive at a result which I wouldn’t consider intuitive.</p>
<p>For those who appreciate numerical experiments, I wrote a small TensorFlow script to accompany this blog post.</p>
<h2 id="statement-of-the-problem">Statement of the problem:</h2>
<p>We would like to minimise the KL-Divergence:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{D}_{KL}(P|Q) = -\int_{-\infty}^\infty p(x) \ln \frac{p(x)}{q(x)}dx
\end{equation}</script>
<p>where <script type="math/tex">P</script> is the target uniform distribution and <script type="math/tex">Q</script> is the approximating Gaussian:</p>
<script type="math/tex; mode=display">\begin{equation}
p(x)= \frac{1}{b-a} \mathbb{1}_{[b-a]} \implies p(x \notin [b-a]) = 0
\end{equation}</script>
<p>and</p>
<script type="math/tex; mode=display">\begin{equation}
q(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x-\mu)^2}{2 \sigma^2}}
\end{equation}</script>
<p>Now, given that <script type="math/tex">\lim_{x \to 0} x\ln(x) = 0</script> if we assume that <script type="math/tex">(a,b)</script> is fixed our loss may be expressed in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\mathcal{L}(\mu,\sigma) & = -\int_{a}^b p(x) \ln \frac{p(x)}{q(x)}dx \\
& = \ln(b-a) - \frac{1}{2}\ln(2\pi\sigma^2)-\frac{\frac{1}{3}(b^3-a^3)-\mu(b^2-a^2)+\mu^2(b-a)}{2\sigma^2(b-a)} \end{split}
\end{equation} %]]></script>
<h2 id="minimising-with-respect-to-mu-and-sigma">Minimising with respect to <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</h2>
<p>We can easily show that the mean and variance of the Gaussian which minimises <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> correspond to the
mean and variance of a uniform distribution over <script type="math/tex">[a,b]</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \mu} \mathcal{L}(\mu,\sigma) = \frac{(b+a)}{2\sigma^2} - \frac{2\mu}{2\sigma^2}= 0 \implies \mu = \frac{a+b}{2}
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \sigma} \mathcal{L}(\mu,\sigma) = -\frac{1}{\sigma}+\frac{\frac{1}{3}(b^2+a^2+ab)-\frac{1}{4}(b+a)^2}{\sigma^3} =0 \implies \sigma^2 = \frac{(b-a)^2}{12}
\end{equation}</script>
<p>Although I wouldn’t have guessed this result the careful reader will notice that this result readily generalises to higher dimensions.</p>
<h2 id="analysing-the-loss-with-respect-to-optimal-gaussians">Analysing the loss with respect to optimal Gaussians:</h2>
<p>After entering the optimal values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> into <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> and simplifying the resulting expression we have
the following residual loss:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{L}^* = -\frac{1}{2}(\ln \big(\frac{\pi}{6}\big)+1) \approx -.17
\end{equation}</script>
<p>I find this result surprising because I didn’t expect the dependence on <script type="math/tex">\Delta = b-a</script> to vanish. That said, my current intuition for this result
is that if we tried fitting <script type="math/tex">\mathcal{U}(a,b)</script> to <script type="math/tex">\mathcal{N}(\mu,\sigma)</script> we would obtain:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
a = \mu - \sqrt{3}\sigma \\
b = \mu + \sqrt{3}\sigma
\end{cases}
\end{equation}</script>
<p>so this minimisation problem corresponds to a linear re-scaling of the uniform parameters in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>.</p>
<h2 id="remark">Remark:</h2>
<p>The reader may experiment with <a href="https://gist.github.com/AidanRocke/0a3ff41c8421a974640742d57bee8b71">the following TensorFlow function</a> which outputs
the approximating mean and variance of a Gaussian given a uniform distribution on the interval <script type="math/tex">[a,b]</script>.</p>Aidan RockeMotivation:Laplace’s Demon and the Singularity2018-03-10T00:00:00+00:002018-03-10T00:00:00+00:00/myths/2018/03/10/singularity_myth<h2 id="introduction">Introduction:</h2>
<p>In my first year as a math undergraduate I remember a fast-talking physics
student who stated that if you folded a piece of rectangular paper enough
times its length would cover the distance between the Earth and the Moon.
Disbelieving, I thought for a few minutes and countered that given that paper
is inelastic the maximum length can’t exceed the length of the diagonal.
Nevertheless, that year I discovered that this belief was remarkably popular among
physics students whose chief interest was to imitate Richard Feynman.</p>
<p>A similar belief, known as the technological singularity, possesses the minds of
many AI researchers although its flaws are numerous and would be obvious to an
outsider with a basic understanding of science. If I may sum it up succinctly,
the singularity is the idea that at some point in the future AI engineers will
create a super human intelligence which will engineer exponentially smarter
versions of itself so humans would no longer have to do science anymore.</p>
<p>In the following article I argue that an irremediable flaw in the singularity notion
is equivalent to the problem of Laplace’s Demon which has been addressed
by statistical physicists in the past.</p>
<h2 id="main-points">Main points:</h2>
<ol>
<li>
<p>Entities whose ‘intelligence’ is monotonically increasing due to recursive self-evaluation converge to an all-knowing ‘super-intelligence’:</p>
<p>If I take ‘intelligence’ to mean a proxy measure of an agent’s degree of control over its environment,
aka Empowerment [1], then the above statement is equivalent to a theory of everything we
know and <em>everything we don’t know</em>. You would be literally extrapolating beyond what could
be reasonably justified. From this it follows that the above statement is plausible with
vanishing probability. I shall clarify this statement below.</p>
</li>
<li>
<p>Assuming discrete time steps, learning time is an exponential function of the median planning horizon:</p>
<p>In a stationary environment which has sparsely distributed spatio-temporal reward signals the
median planning horizon will tend to be quite large. Let’s suppose that by large we mean
10 time steps into the future and that the agent has a discrete action space of size 4.
In this case, the agent would need <script type="math/tex">\sim 4^{10}</script> action sequences in order to discover an
optimal policy. We haven’t even considered the case of survivor bias: adversarial action sequences where
the agent regularly obtains a reward due to chance but can’t reliably obtain this reward in the future
due to a sequence of actions. Furthermore, the case of a continuous action space isn’t very different
because you can construct an equivalence relation over actions which produce the same outcome or
goal within <script type="math/tex">n</script> time steps.</p>
<p>What this means is that without a good model of the environment, higher-level abstractions which
would allow the agent to reduce the learning time required, are non-existent. My next point reiterates this.</p>
</li>
<li>
<p>Assuming the existence of atomic actions, learning time is an exponential function of the degree of hierarchy/abstraction of the action space:</p>
<p>In 1814, none other than Pierre Simon Laplace argued that we could theoretically model all behaviour with some large
Newtonian many-body system. The exact quote is the following:</p>
<blockquote>
<p>We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.-Laplace</p>
</blockquote>
<p>Assuming Laplace is correct, if we make minimal assumptions about the environment, how long would it take for complex
locomotion behaviours to emerge from this system? Closer scrutiny of this question would reveal that it’s essentially
equivalent to the previous question. Assuming actions at the microscopic scale(ex. muscle activations), complex locomotion
behaviours are equivalent to very long planning horizons. Without good prior knowledge(i.e. a model) of the
hierarchical/compositional behaviour of muscle tissues you might as well be running a simulation for ages
in order to obtain the appropriate samples.</p>
<p>The situation is actually worse than we suppose as Laplace is incorrect in the case of the observable Universe. Physicists since Boltzmann have demonstrated that information is lost over time so besides the combinatorial complexity of simulation there’s the issue of uncertainty propagation across the simulation with
every calculation.</p>
</li>
</ol>
<h2 id="summary">Summary:</h2>
<p>For the reasons I gave above I think it’s clear that in complex environments of which nothing is known, any reasonable interpretation of super-intelligence(ex. maximal empowerment) is essentially equivalent to Laplace’s Demon. It follows that the asymptotic limit of an all-knowing super-intelligence is not only highly improbable.
Given our current scientific knowledge, it’s practically impossible.</p>
<p>Singularitarians forget that science is about quantifying what we don’t know via experiments and not attaining ‘human-level’ understanding. Barring omniscient robots, I don’t see a point in the future when human scientists will be out of a job.</p>
<h1 id="references">References:</h1>
<ol>
<li>Salge, Calkin & Polani. Empowerment – an Introduction. 2013.</li>
<li>Laplace. A philosophical essay on probabilities. 1814.</li>
</ol>Aidan RockeIntroduction:What is the role of logic in Mathematics?2017-11-27T00:00:00+00:002017-11-27T00:00:00+00:00/mathematics/2017/11/27/platonic_math<h2 id="introduction">Introduction:</h2>
<p>The orthodox belief among pure mathematicians is that the foundations of mathematics are grounded in a few sacred axioms
and set theory where logic naturally has a central role in its development. However, by means of a simple thought experiment
I show that curiosity, more than logic, is essential for the development of mathematics. Moreover, I argue that
curiosity is firmly grounded in both our sensorimotor experience and the tools we use for doing mathematics.</p>
<p>This leads to a holistic account of the foundations of mathematics which challenges the Platonic notion that
‘pure’ mathematics is discovered and makes the case that the envelope of potential mathematical
discoveries is parametrised by both human morphology and technologies for doing mathematics. Crucially, this ‘Cyborg’ view
of mathematics has important implications for investigations on the foundations of mathematics as well as the manner
mathematics is taught at the university level.</p>
<h2 id="the-role-of-logic-in-mathematics">The role of logic in mathematics:</h2>
<p>While the importance of axiomatics and set theory in structuring mathematics is undeniable, I think we should not lose sight
of what logic actually provides:</p>
<ol>
<li>A system for verifying our discoveries to an axiomatic level of detail.</li>
<li>A method for communicating our mathematical discoveries in a convincing manner.</li>
</ol>
<p>In truth, the second argument has much greater weight than the first since an important consequence of Gödel’s incompleteness
theorems is that logic doesn’t guarantee the permanence of our mathematical discoveries. Furthermore, very few mathematicians
use formal proof assistants like Coq or Isabelle to write their mathematical proofs although proof assistants are practically
essential for verification at an axiomatic level of detail. How can we explain this?</p>
<p>Like all humans, mathematicians pursue rigor only to the extent that its cost justifies the reward. That said, if logical verification
isn’t essential to mathematics what could possibly be the vital force behind its development?</p>
<h2 id="the-importance-of-curiosity">The importance of curiosity:</h2>
<p>While I would grant that logical verification is important for problem solving in mathematics, if mathematics was reducible to
problem solving we would have no more than one mathematical question to answer(ex. 2+2=?) and there wouldn’t have been a field
of mathematics. In other words, there has to be some intrinsic motivation in all mathematicians which drives them to not only
solve problems but also seek out problems to solve. From this it follows that intrinsic motivation(or curiosity) has a much greater
role than logic in explaining why there are multiple branches of mathematics. In fact, this implies that curiosity not logic has to
be the vital force which guides its development.</p>
<p>Such a line of reasoning is especially relevant to investigations on the foundations of mathematics as it immediately raises doubts
on the platonic account of mathematics. This however raises important epistemological questions concerning the nature of curiosity.</p>
<h2 id="the-origin-and-development-of-mathematics">The origin and development of mathematics:</h2>
<p>In [2], Poincaré famously argues that primitive mathematical notions like size, continuity and number have imprecise perceptual origins. A child can learn to tell the difference in size between a big dog and a small dog without having to first learn about the greater than relation. Such perceptual faculties effectively serve as good priors for learning mathematics, a task which would be considerably harder otherwise. In addition, there is a wide range of scientific evidence presented in [1] demonstrating that-besides being the origin of our mathematical knowledge-our sensorimotor experience is an essential guide in our mathematical development. This means that our curiosity is constrained by both our morphology and the tools we use for doing mathematics.</p>
<p>While mathematical reasoning often conforms to mathematical principles, it is typically implemented in a sensorimotor loop which includes a device for data-input(ex. pen/pencil) and material for data-storage(ex. paper). In this context, the authors of [1] advance a Cyborg view of mathematics:</p>
<blockquote>
<p>…the active manipulation of physical notations plays the role of ‘guiding’ the biological machinery through an abstract mathematical problem space-one that may exceed the space of otherwise solveable problems.</p>
</blockquote>
<p>Although many mathematicians might contest this, I wonder whether any mathematician can do advanced mathematics without pen and paper, or a functional substitute. We must also acknowledge the increasingly important role of the computer for doing research-level mathematics.</p>
<p>In addition, we must note a more subtle but equally significant technology; mathematical notation has evolved over time by a process which isn’t arbitrary. While the space of satisfactory mathematical notations might be large, most randomly generated notations are bad for doing mathematics which is why mathematicians define <a href="https://mathoverflow.net/questions/42929/suggestions-for-good-notation">rules of thumb for good notation</a>. The triumph of Leibniz notation over Newton’s notation is a concrete example of this. Moreover, Terrence Tao once wrote a full <a href="https://terrytao.wordpress.com/advice-on-writing-papers/use-good-notation/">blog post</a> on this issue which includes the following quote due to Alfred North Whitehead:</p>
<blockquote>
<p>By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race.</p>
</blockquote>
<p>Yet, this statement flies in the face of Cognitive Science orthodoxy as stated in [1]:</p>
<blockquote>
<p>Cognitive scientists have traditionally viewed this capacity-the capacity for symbolic reasoning-as grounded in the ability to internally represent numbers, logical relationships, and mathematical rules in an abstract, amodal fashion.</p>
</blockquote>
<p>Clearly, this line of reasoning is absurd. If anything both scientific and empirical evidence strongly indicates that our sensorimotor experience is an essential substrate for mathematical thought and not merely a translational medium. When combined with the importance of curiosity it follows that we
have to encourage individual experimentation with technologies aiding mathematical activity in order to maximise the collective human potential for
mathematical discovery.</p>
<h2 id="conclusion">Conclusion:</h2>
<p>Having laid out these arguments, I think it’s clear that the Cyborg view of mathematics provides more stable foundations for mathematics than the orthodox view which is not only scientifically and empirically baseless, but also diminishes our collective potential for mathematical discovery. In particular, I would like to point out a few key innovations in the Cyborg tradition which have yet to be fully appreciated at the university level.</p>
<p>The first is the use of online blogs for communicating mathematical ideas as written homework/projects can be very isolating rather than engaging. You generally get very little feedback even if you do get a good mark which trivialises the activity. Second, is the creation of <a href="https://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/">Polymath projects</a> for exploring the role of large-scale self-organizing collaboration among students. Finally, I think mathematicians of all levels of ability can benefit from using <a href="http://jupyter.org/">Jupyter notebooks</a> for interactive experimental mathematics as I have whenever investigating problems in combinatorics or probability.</p>
<p>In my opinion, these innovations indicate yet-unrealised potential. Indeed, I believe that if the majority of mathematicians transition towards a Cyborg perspective of mathematical foundations, we shall witness a much more creative period of mathematics.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>A perceptual account of symbolic reasoning (David Landy, Colin Allen & Carlos Zednik. 2014. frontiers in Psychology.)</p>
</li>
<li>
<p>La Science et L’Hypothèse (Henri Poincaré. 2014. Champs Sciences.)</p>
</li>
</ol>Aidan RockeIntroduction:The theoretical limitations of DQN2017-08-29T00:00:00+00:002017-08-29T00:00:00+00:00/inference/2017/08/29/dqn<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/dqn.jpg" align="middle" /></center>
<h1 id="introduction">Introduction:</h1>
<p>Less than three years after the publication of Deep Mind’s publication ‘Playing Atari with Deep Reinforcement Learning’
the practical impact of this method on RL literature has been profound, as evidenced by the above graphic. However, the
theoretical limitations of the original method haven’t been thoroughly investigated. As I will show, such an analysis
actually clarifies the evolution of DQN and highlights which research directions are worth prioritising.</p>
<h1 id="background-on-dqn">Background on DQN:</h1>
<p>The main idea behind Deep Q-learning, hereafter referred to as DQN, is that given actions <script type="math/tex">a \in \mathcal{A}</script> and states <script type="math/tex">x \in X</script> in a Markov
Decision Process(MDP), it’s sufficient to optimise action selection with respect to the expected return:</p>
<script type="math/tex; mode=display">\begin{equation}
Q_{\pi}(x,a) = \mathbb{E} [\sum_{t=0}^{\infty} \gamma^t R(x_t,a_t)], \gamma \in (0,1)
\end{equation}</script>
<p>In particular the aim is to approximate a parametrised value function <script type="math/tex">Q(x,a;\theta_t)</script> where estimation is shifted towards the target:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta_t)
\end{equation}</script>
<p>and gradient descent updates are done as follows:</p>
<script type="math/tex; mode=display">\begin{equation}
\theta_{t+1} = \theta_t + \alpha(Y_t^Q-Q(S_t,A_t;\theta_t)) \nabla_{\theta} Q(S_t,A_t;\theta_t)
\end{equation}</script>
<p>In addition, epsilon-greedy approaches are used for exploration and to avoid estimates that merely reflect
recent experience the authors of DQN regularly allow the network to perform experience replay: batch updates
based on less recent experience.</p>
<p>Given the above description of DQN, we may note the following:</p>
<ol>
<li>Selection and evaluation in DQN is done with respect to the same parameters <script type="math/tex">\theta_t</script>.</li>
<li>Assuming that variance is unavoidable, the <script type="math/tex">\max</script> operator in (2) leads to over-optimistic estimates.</li>
<li>The expression in (1) provides an asymptotic guarantee which implicitly requires an ergodic MDP.</li>
</ol>
<p>These issues shall be addressed in the sections that follow.</p>
<h1 id="asymptotic-nonsense-or-the-data-inefficiency-of-dqn">Asymptotic nonsense or the data-inefficiency of DQN:</h1>
<p>In the simple case of i.i.d. data <script type="math/tex">X_i</script> if <script type="math/tex">S_n = \sum_{i=1}^{n} X_i</script> and <script type="math/tex">\mathbb{E}[X_i] = \mu</script>, a simple application of Chebyshev’s inequality gives:</p>
<script type="math/tex; mode=display">\begin{equation}
\forall \epsilon > 0, P(|\frac{S_n}{n}-\mu| > \epsilon) \leq \frac{\sigma}{n \epsilon^2}
\end{equation}</script>
<p>Essentially, this inequality shows that even in simple scenarios convergence in expectation requires a lot of data
and the rate of convergence depends on the variance <script type="math/tex">\sigma</script>. Furthermore, we must note that this inequality ignores
the following facts:</p>
<ol>
<li>For fixed <script type="math/tex">(x,a)</script>, <script type="math/tex">Q_{\pi}(x,a)</script> is rarely unimodal in practice.</li>
<li><script type="math/tex">Q_{\pi}(x,a)</script> rarely has negligible variance.</li>
<li>Our data is sequential and hardly ever i.i.d.</li>
</ol>
<p>From these points it follows that important estimation errors are unavoidable but as I will show, this isn’t the main
problem.</p>
<h1 id="the-unreasonable-optimism-of-dqn">The unreasonable optimism of DQN:</h1>
<ol>
<li>
<p>Over-optimism with respect to estimation errors:</p>
<p>The authors in [3] highlight that in (2), evaluation of the target <script type="math/tex">Y_t^Q</script> and action selection are done with respect to
the same parameters <script type="math/tex">\theta_t</script> which over-optimistic value estimates more likely with respect to the <script type="math/tex">\max</script> operator.
This suggests that estimation errors of any kind are more likely to result in overly-optimistic policies.</p>
<p>While this is problematic, the authors of [3] discovered the following elegant solution:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta'_{t})
\end{equation}</script>
<p>The resulting method, known as Double DQN, essentially decouples selection and evaluation by using two sets of weights <script type="math/tex">\theta</script>
and <script type="math/tex">\theta'</script>.</p>
</li>
<li>
<p>Over-optimism with respect to risk regardless of estimation error:</p>
<p>Consider the classic problem in decision theory of having to choose between an envelope <script type="math/tex">A</script> which contains $90.00 and envelope
<script type="math/tex">B</script> which contains $200.00 or $0.00 with equal probability. Although <script type="math/tex">Var[A] \ll Var[B]</script>, our agent’s
ignorance of the bimodality of <script type="math/tex">B</script> would lead it to act in an over-optimistic fashion. Due to the <script type="math/tex">\max</script> operator
it would make a decision solely based on the fact that <script type="math/tex">\mathbb{E}[B] > \mathbb{E}[A]</script>.</p>
<p>The above problem clearly requires a very different perspective.</p>
</li>
</ol>
<p>Two papers which address the second problem are [5] and [7]. While I won’t go into either paper in any detail I would recommend that the
reader start with [5] which provides an elegant and scalable solution with what can be thought of as a data-dependent
version of dropout [8]. The consideration of value distributions helps reduce uncertainty and improve inference.</p>
<h1 id="the-latent-value-of-hierarchical-models">The latent value of hierarchical models:</h1>
<p>Perhaps the most important question when considering the evolution of DQN is how will these agents develop rich conceptual abstractions
that will allow scientific induction or generalisation. Although one can argue that a DQN learns good statistical representations of
environmental states <script type="math/tex">x</script> it doesn’t learn any higher-order abstractions such as concepts. Moreover, vanilla DQN is purely reactive
and doesn’t incorporate planning in any meaningful sense. This is where Hierarchical Deep Reinforcement Learning can play a very important role.</p>
<p>In particular, I would like to mention the promising work of Tejas Kulkarni who investigated the use of hierarchical DQN, which has the following architecture:</p>
<ol>
<li>Controller: which learns policies in order to satisfy particular goals</li>
<li>Meta-Controller: which chooses goals</li>
<li>Critic: which evaluates whether a goal has been achieved</li>
</ol>
<p>Together these three components cooperate so that a high-level policy is learned over intrinsic goals and a lower-level policy is learned
over ‘atomic’ actions to satisfy the given goals. The work, which I’ve only vaguely described, opens up a lot of interesting
research directions which may not seem immediately obvious. One I’d like to mention is the possibility of learning a
grammar over policies. I think this might be a necessary component for the emergence of language in machines.</p>
<p>The interpretation of the ‘Critic’ is also very interesting. Perhaps one can argue that it provides the agent with a rudimentary form of
introspection.</p>
<h1 id="conclusion">Conclusion:</h1>
<p>I find it remarkable that a simple method such as DQN should inspire many new approaches. Perhaps it’s not so much the brilliance
of the method but rather its generality which allowed this method to adapt and evolve. In particular, I think the coupling
of Distributional RL with Hierarchical Deep RL has a very bright future. Together, this will lead to signficant improvements in terms of inference and generalisation.</p>
<p><strong>Note:</strong> The graphic is taken from [9].</p>
<h1 id="references">References:</h1>
<ol>
<li>C. J. C. H. Watkins, P. Dayan. Q-learning. 1992.</li>
<li>V. Minh, K. Kavukcuoglu, D. Silver et al. Playing Atari with Deep Reinforcement Learning. 2015.</li>
<li>H. van Hasselt ,A. Guez and D. Silver. Deep Reinforcement Learning with Double Q-learning. 2015.</li>
<li>Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. 2017.</li>
<li>Ian Osband, Charles Blundell, Alexander Pritzel and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN. 2016.</li>
<li>Tejas Kulkarni et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. 2016.</li>
<li>Marc G. Bellemare, Will Dabney and Rémi Munos. A Distributional Perspective on Reinforcement Learning. 2017.</li>
<li>Yarin Gal & Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. 2016.</li>
<li>Niels Justesen, Philip Bontrager, Julian Togelius, Sebastian Risi. Deep Learning for Video Game Playing. 2017.</li>
</ol>Aidan RockeEntropy Maximization and intelligent behaviour2017-07-06T00:00:00+00:002017-07-06T00:00:00+00:00/intelligence/2017/07/06/maxent<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/forking_paths.png" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>Sergio Hernandez, a Spanish mathematician, recently shared some <a href="http://entropicai.blogspot.fr/2017/06/solved-atari-games.html">very interesting results</a> on the OpenAI gym environment which are based on a <a href="http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf">relatively unknown paper</a>
published by Dr. Wissner-Gross, a physicist trained at MIT. What is impressive about Wissner’s meta-heuristic is that it is succinctly described by three equations which try to maximize the future freedom of your agent. In this analysis, I summarize the method, present its strengths and weaknesses, and attempt to improve it by making an important modification to one of the equations.</p>
<h2 id="causal-entropic-forces">Causal entropic forces:</h2>
<p>In the following summary of Wissner’s meta-heuristic, it’s assumed that the agent has access to an approximate or exact simulator. A close reading of
the original paper [1] will show that this assumption is actually necessary.</p>
<h3 id="macrostates">Macrostates:</h3>
<p>For any open thermodynamic system, we treat the phase-space paths taken by the system <script type="math/tex">x(t)</script> over the time interval <script type="math/tex">[0,\tau]</script> as microstates
and partition them into macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script> using the equivalence relation[1]:</p>
<script type="math/tex; mode=display">\begin{equation}
x(t) \sim x'(t) \iff x(0) = x'(0)
\end{equation}</script>
<p>As a result, we can identify each macrostate <script type="math/tex">X_i</script> with a unique present system state <script type="math/tex">x(0)</script>. This defines a notion of causality over a time interval.</p>
<h3 id="causal-path-entropy">Causal path entropy:</h3>
<p>We can define the causal path entropy <script type="math/tex">S_c</script> of a macrostate <script type="math/tex">X_i</script> with the associated present system state <script type="math/tex">x(0)</script> as the path integral:</p>
<script type="math/tex; mode=display">\begin{equation}
S_c (X_i, \tau) = -k_B \int_{x(t)} P(x(t)|x(0)) \ln P(x(t)|x(0)) \,D x(t)
\end{equation}</script>
<p>where we have:</p>
<script type="math/tex; mode=display">\begin{equation}
P(x(t)| x(0)) = \int_{x^*(t)} P(x(t),x^*(t) |x(0)) \,D x^*(t)
\end{equation}</script>
<p>In (3) we basically integrate over all possible paths <script type="math/tex">x^*(t)</script> taken by the open system’s environment. In practice, this integral is intractable
and we must resort to approximations and the use of a sampling algorithm like Hamiltonian Monte Carlo [3].</p>
<h3 id="causal-entropic-force">Causal entropic force:</h3>
<p>A path-based causal entropic force <script type="math/tex">F</script> may be expressed as:</p>
<script type="math/tex; mode=display">\begin{equation}
F(X_0, \tau) = T_c \nabla_X S_c (X, \tau) |_{X_0}
\end{equation}</script>
<p>where <script type="math/tex">T_c</script> and <script type="math/tex">\tau</script> are two free parameters. This force basically brings us closer to macrostates <script type="math/tex">X_j</script> that
maximize <script type="math/tex">S_c (X_i, \tau)</script>. In essence the combination of equations (2), (3) and (4) maximize the number of future options
of our agent. This isn’t very different from what most people try to do in life but this meta-heuristic does have very important
limitations.</p>
<h2 id="limitations-of-the-causal-entropic-approach">Limitations of the Causal Entropic approach:</h2>
<ol>
<li>
<p>The Causal Entropic paper makes the implicit assumption that we have access to a reliable simulator of future states. In the
case of the OpenAI environments this isn’t a problem because environment simulators are provided but in general it’s a hard problem. Two useful approaches to this problem
are suggested by [4] and [5] using recurrent neural networks.</p>
</li>
<li>
<p>Maximizing your number of future options is not always a good idea. Sometimes fewer options are better provided that these are
more useful options. This is why for example, football players don’t always rush to the center of a football pitch, although from
that position they would maximize their number of future states i.e. possible positions on the pitch.</p>
</li>
</ol>
<p>In the next section I would like to show that it’s possible to find a practical solution to the second limitation by modifying
(3).</p>
<h2 id="causal-path-utility">Causal Path Utility:</h2>
<p>Assuming that a recurrent neural network is used to define potential macrostates <script type="math/tex">\{ X_i \}_{i \in I}</script>, it’s reasonable to assume
that our agent’s understanding of the future evolves with time and therefore macrostates are a function of time. So we have <script type="math/tex">\{ X_i(t) \}_{i \in I}</script>
rather than <script type="math/tex">\{ X_i \}_{i \in I}</script>. In other words, our simulator which might be an RNN, will probably change its parameters and
even its topology over time.</p>
<p>In order to resolve the second limitation and encourage the agent to make confident decisions,
I propose that we replace <script type="math/tex">S_c(X, \tau)</script> with <script type="math/tex">U_c(X, \tau)</script> where:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
U_c (X_i, \tau) & = -\int_{x(t)} P(x(t)|x(0)) \ln (U(x(t)|x(0) e^{-Var[U(x(t)\mid x(0))]}) ,D x(t) \\
& = \mathbb{E}[-\ln U(x(t)|x(0))]+\mathbb{E}[Var[U(x(t)\mid x(0))]] \geq 0\end{split}
\end{equation} %]]></script>
<p>This not only has the added value of simplifying calculations but also allows us to disentangle the relative contributions of utility and uncertainty.
It must also be noted that the two expressions in (5) can be calculated in parallel although the uncertainty calculation is more computationally
expensive.</p>
<h2 id="discussion">Discussion:</h2>
<p>If we assume that the agent’s perception of the future doesn’t change much, it might perceive some future states to be ideal. This is
consistent with the empirical observation that many people believe certain accomplishments would bring them ‘genuine happiness’. In other
words, if the state space is compact and approximately time-invariant the agent’s optimal future macrostate converges to a fixed point [6].</p>
<p>While the notion of Causal Path Utility just occurred to me today, I believe that this is a very promising approach which I shall follow-up with concrete implementations very soon.</p>
<h1 id="references">References:</h1>
<ol>
<li>
<p>Causal Entropic Forces (A. D. Wissner-Gross & C.E. Freer. 2013. Physical Review Letters.)</p>
</li>
<li>
<p>Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (Yarin Gal & Zoubin Ghahramani. 2016. ICML. )</p>
</li>
<li>
<p>Stochastic Gradient Hamiltonian Monte Carlo ( Tianqi Chen, Emily Fox & Carlos Guestrin. 2014. ICML.)</p>
</li>
<li>
<p>Recurrent Environment Simulators (Silvia Chappa et al. 2017. ICLR.)</p>
</li>
<li>
<p>On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models (J. Schmidhuber. 2015.)</p>
</li>
<li>
<p>Fixed Point Theorems with Applications to Economics and
Game Theory (Border, Kim C. 1985. Cambridge University Press.)</p>
</li>
</ol>Aidan RockeThe weight transport problem2017-06-30T00:00:00+00:002017-06-30T00:00:00+00:00/deep/learning/2017/06/30/weight-transport<h2 id="introduction">Introduction:</h2>
<p>In an excellent paper published less than two years ago, Timothy Lillicrap, a theoretical neuroscientist at DeepMind, found
a simple yet reasonable solution to the weight transport problem. Essentially, Timothy and his co-authors showed that it’s
possible to do backpropagation with random weights and still obtain very competitive results on various benchmarks [2]. The
reason why this is really significant is that it marks an important step towards biologically plausible deep learning.</p>
<h2 id="the-weight-transport-problem">The weight transport problem:</h2>
<p>While backpropagation is a very effective approach for training deep neural networks, at present it’s not at all clear whether
the brain might actually use this method for learning. In fact, backprop has three biologically implausible requirements [1]:</p>
<ol>
<li>feedback weights must be the same as feedforward weights</li>
<li>forward and backward passes require different computations</li>
<li>error gradients must be stored separately from activations</li>
</ol>
<p>A biologically plausible solution to the second and third problems is to use an error propagation network with the same topology
as the feedforward network but used only for backpropagation of error signals. However, there is no known biological mechanism
for this error network to know the weights of the feedforward network. This makes the first requirement, weight symmetry, a
serious obstacle.</p>
<p>This is also known as the weight transport problem [3].</p>
<h2 id="random-synaptic-feedback">Random synaptic feedback:</h2>
<p>The solution proposed by Lillicrap et al. is based on two good observations:</p>
<ol>
<li>
<p>Any fixed random matrix <script type="math/tex">B</script> may serve as a substitute
for the original matrix <script type="math/tex">W</script> in backpropagation provided that on average we have:</p>
<script type="math/tex; mode=display">\begin{equation}
e^\top WB e > 0
\end{equation}</script>
<p>where <script type="math/tex">e</script> is the error in the network’s output. Geometrically, this is equivalent to requiring that <script type="math/tex">e^\top W</script> and <script type="math/tex">Be</script> are within
<script type="math/tex">90^{\circ}</script> of each other.</p>
</li>
<li>
<p>Over time we get better alignment between <script type="math/tex">W</script> and <script type="math/tex">B</script> due to the modified update rules which means that the first requirement becomes
easier to satisfy with more iterations.</p>
</li>
</ol>
<h2 id="a-simple-example">A simple example:</h2>
<p>Let’s consider a simple three layer linear neural network that is intended to approximate a linear mapping:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
h = W_0 x \\
y = W h \\
e = Tx -y
\end{cases}
\end{equation}</script>
<p>The loss is given by:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{L} = \frac{1}{2} e^\top e
\end{equation}</script>
<p>From this we may derive the following backpropagation update equations:</p>
<script type="math/tex; mode=display">\begin{equation}
\Delta W \propto \frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial e} \frac{\partial e}{\partial y} \frac{\partial y}{\partial W} = e \cdot -1 \cdot h = e h^\top
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\Delta W_0 \propto \frac{\partial \mathcal{L}}{\partial W_0} = \frac{\partial \mathcal{L}}{\partial e} \frac{\partial e}{\partial y} \frac{\partial y}{\partial h} \frac{\partial e}{\partial W_0} = e \cdot (-1) \cdot W \cdot x = -W^\top e x^\top
\end{equation}</script>
<p>Now the random synaptic feedback innovation is essentially to replace step <script type="math/tex">(5)</script> with:</p>
<script type="math/tex; mode=display">\begin{equation} \Delta W_0 \propto B e x^\top
\end{equation}</script>
<p>where <script type="math/tex">B</script> is a fixed random matrix. As a result, we no longer need explicit knowledge of the original weights in our update equations.
I actually implemented this method for a three-layer sigmoid (i.e. nonlinear) neural network and obtained <a href="https://github.com/pauli-space/weight_symmetry/blob/master/experiments/random_synaptic_feedback/three_layer.py">89.5% accuracy on the MNIST dataset
after 10 iterations</a>, a result
that is competitive with backpropagation.</p>
<h2 id="discussion">Discussion:</h2>
<p>In spite of its remarkable simplicity, Timothy Lillicrap’s solution to the weight transport problem is very effective and so I think it
deserves further investigation. In the near future I plan to implement random synaptic feedback for much larger sigmoid and ReLU networks
as well as recurrent neural networks in order to build upon the work of [1].</p>
<p>Considering all the approaches to biologically plausible deep learning attempted so far, I believe this work represents a very important step forward.</p>
<h2 id="references">References:</h2>
<ol>
<li>How Important Is Weight Symmetry in Backpropagation? (Qianli Liao, Joel Z. Leibo, Tomaso A. Poggio. 2016. AAAI.)</li>
<li>Random synaptic feedback weights support error backpropagation for deep learning(Lillicrap 2016. Nature communications.)</li>
<li>Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.</li>
</ol>Aidan RockeIntroduction:deep rectifier networks: preliminary observations2017-06-21T00:00:00+00:002017-06-21T00:00:00+00:00/deep/learning/2017/06/21/observations_1a<p>Approximately one week ago, I defined a <a href="http://paulispace.com/deep/learning/2017/06/15/experiment_1.html">set of experiments</a> in order to model the effects of dropout and unsupervised pre-training on deep rectifier networks. However, prior to running through the experiments I realised that this was an opportunity to develop my own personal research workflow. After more reflection I decided to follow this particular process:</p>
<ol>
<li>Define experiments: including methodology, experimental setup and working hypotheses</li>
<li>Share preliminary observations: in order for readers to understand where scientific intuitions come from and overcome writer’s block</li>
<li>Experimental analysis: detailed statistical analysis of experimental results including hypothesis testing</li>
<li>Theoretical analysis: theoretical analysis of experimental results</li>
<li>Further discussion: discuss phenomena that are worth investigating further</li>
</ol>
<p>The present blog post aims to go through a part of stage 2. In particular, today I aim to share interesting observations concerning vanilla
three-layer rectifier networks with 500 nodes per layer trained on the MNIST dataset without dropout or unsupervised pre-training.</p>
<h2 id="visualizing-binary--in-activation-space">Visualizing binary in activation space:</h2>
<div class="image-wrapper">
<img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/activation_space.png" alt="" />
<p class="image-caption">two dimensional embedding of binary activations</p>
</div>
<p>Above we have a two dimensional linear embedding of binary representations which was obtained by applying PCA to the concatenated output of hidden layers, where a binary mask was applied to the output of each layer. This method is inspired by [5] where the authors used a similar method to study local competition among subnetworks within deep rectifier networks. Although I didn’t manage to get clusters that are as well-separated as
R. Srivastava, we have clear evidence of emergent organisation among subnetworks within deep rectifier networks.</p>
<p>In particular, we may note that 1 is very near 2, 7 is near 9, 0 blends with 4. A Canadian AI researcher might argue that 0 is entangled with 4 [6]. However, the explained variance due to PCA(n=2) was around 40% which means that a lot of information was lost in the process of going from 1500 dimensions to 2 dimensions. This suggests that we might need a more reliable method for analysing variable disentangling.</p>
<h2 id="variable-disentangling">Variable disentangling:</h2>
<h3 id="the-average-euclidean-distance-between-representations-per-class">The average Euclidean distance between representations per class:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/average_euclidean_distance.png" align="middle" /></center>
<p>What the above heatmap shows is the average euclidean distance between binary representations for a particular class label, which is useful
as the average value gives an indication of the relative contribution of each node when predicting a particular class. In particular, we note
that 7 appears to be quite close to 9 but 0 doesn’t appear to be particularly close to 4. This is why I always use low dimensional visualizations
with caution.</p>
<p>I also tried a different approach for analysing variable disentangling which gave very interesting and unexpected results.</p>
<h3 id="fraction-of-nodes-shared-per-class">Fraction of nodes shared per class:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/nodes_shared.png" align="middle" /></center>
<p>The above heatmap shows that the fraction of pair-wise nodes shared per class is always above 90% which is quite surprising. Basically this means that
different subnetworks that are tasked with predicting different things often share at least 90% of their nodes. What this means is that there is basically
a core representation that is frequently reused with some small variations between each example and these small variations are very important. In some sense the deep rectifier network is very efficient at sharing resources and I believe this relates well to the notion of local competition described by R. Srivastava in [5]. I also think it merits further study.</p>
<p>Prior to studying the fraction of shared nodes between subnetworks, I imagined that the relative sparsity of activity in deep rectifier networks implied
that the above observation would be quite improbable. In fact, the mean activations per hidden layer is something I looked into as well.</p>
<h2 id="mean-activity-per-hidden-layer-per-epoch">Mean activity per hidden layer per epoch:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/mean_activations.png" align="middle" /></center>
<p><br /></p>
<p>If it’s not clear, the above set of histograms show the mean activations for each of the three layers for each of the five epochs. What I find interesting is that we observe:</p>
<ol>
<li>Convergence in distribution which was quantified using the conditional entropy.</li>
<li>The mean activation for the first hidden layer has a mode around 0.5 whereas the mean activation
for the second and third hidden layers have a mode around 0.7</li>
<li>This indicates that on average (0.7+0.7+0.5)/3=63% of the nodes are used at any given time. Based on what
I’ve read in [1] I would expect this fraction to decrease if we fix the width while we increase the depth of
the network but it appears that we don’t yet have a good mathematical model to predict the number of active
nodes given a dataset with a particular sample complexity.</li>
</ol>
<p>Now, although it wasn’t suggested in any of the papers I’ve read so far I figured that I could probably use
the mean activations per hidden layer to study variable-size representation as well as sparsity. My reasoning
was that if a particular class required subnetworks with more nodes than another class on average then this
would probably capture the notion of variable-size representation as described in [6]:</p>
<blockquote>
<p>Varying the number of active neurons allows a model to control the effective dimensionality of the representation
for a given input and the required precision. - X. Glorot, Y. Bengio & A. Bordes</p>
</blockquote>
<h2 id="variable-size-representation">Variable-size representation:</h2>
<div align="center">
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th> rank </th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
</tr>
</thead>
<tbody>
<tr>
<th>variable size</th>
<td>1</td>
<td>9</td>
<td>3</td>
<td>7</td>
<td>6</td>
<td>5</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>10</td>
</tr>
</tbody>
</table>
</div>
<p><br /></p>
<p>This table effectively shows how the neural network represents the relative dimensionality of each class. The way I obtained this was by calculating the
average number of nodes used to predict each class and then ranking these values by size.</p>
<p>An interesting and essential follow-up question is whether this relative order is respected when we train a rectifier network with the same architecture on a sample of the 10 original classes. If we did pair-wise experiments for example we would have to do 45 experiments. If the relative order is difficult
to reproduce then we have a problem with the notion of variable size. Right now I am not sure whether there’s a simple theory that would explain how a neural network controls variable size. The only way to find out is to do the experiments.</p>
<h2 id="the-example-ordering-problem">The example ordering problem:</h2>
<p>Finally, I also tried to take a look at the example ordering problem, a limitation of gradient descent for training neural networks that was noted by [3].
As they noted, the relative contribution of each epoch to the model that emerges within a neural network isn’t representative of the information available
per epoch. In fact, we observe that the weights change in a much more important manner during the earlier epochs compared to later epochs:</p>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/weight_norms.png" align="middle" /></center>
<p>What the above plot shows is that the change in the weight norm is more important during the earlier epochs compared to the later epochs. This is consistent with the model of gradient descent obtained in [4] which emphasises an approximate isomorphism between gradient descent and high-dimensional
damped oscillators but this isn’t good news. Basically, this means that gradient descent is not a data efficient method for learning signals from
data.</p>
<p>I think that the only way to avoid this is to add temporal memory to networks so they may perform inference forwards and backwards in time. If I analysed
this problem further I would probably rediscover one of the many recurrent neural network architectures or I might discover my own architecture. Very often it’s useful to approach problems as if they haven’t been investigated before as that’s the only way to become a good theoretician.</p>
<h2 id="discussion">Discussion:</h2>
<p>This marks the end of my first observational study and I think you’ll agree with me that it has highlighted many questions that are worth further
investigation. So you might ask, what next? I plan to do more detailed observational studies on the following questions in the following order:</p>
<ol>
<li>Sparsity of representations as we increase the depth of a rectifier network while keeping the width constant</li>
<li>Stability of relative variable size for randomly chosen subclasses</li>
<li>Solutions to the example ordering problem</li>
</ol>
<p>I will continue to use the MNIST dataset but I will try to find models that take into account sample complexity for each of the above questions
so these models should generalise well. Once I’ve gone through these semi-formal observational studies which are useful for developing intuitions
I’ll proceed with the experiments I’ve defined earlier.</p>
<p><strong>Note:</strong> If you’d like to repeat this analysis, the code I used to perform this analysis is available <a href="https://github.com/pauli-space/deep_rectifiers">here</a>
but I would wait until the weekend because I’m going to make important changes. It’s a bit of a mess at the moment.</p>
<h2 id="references">References:</h2>
<ol>
<li>Representation Learning: A Review and New Perspectives (Y. Bengio et al. 2013. IEEE Transactions on Pattern Analysis and Machine Intelligence.)</li>
<li>Dropout: A Simple Way to Prevent Neural Networks from Overfitting (N. Srivastava et al. 2014. Journal of Machine Learning Research.)</li>
<li>Why Does Unsupervised Pre-training Help Deep Learning? (D. Erhan et al. 2010. Journal of Machine Learning Research.)</li>
<li>The Physical Systems behind Optimization (L. Yang et al. 2017.)</li>
<li>Understanding Locally Competitive Networks (R. Srivastava et al. 2015.)</li>
<li>Deep Sparse Rectifier Neural Networks (X. Glorot, A. Bordes & Y. Bengio. 2011. Journal of Machine Learning Research.)</li>
</ol>Aidan RockeApproximately one week ago, I defined a set of experiments in order to model the effects of dropout and unsupervised pre-training on deep rectifier networks. However, prior to running through the experiments I realised that this was an opportunity to develop my own personal research workflow. After more reflection I decided to follow this particular process:derivation of common activation functions2017-06-18T00:00:00+00:002017-06-18T00:00:00+00:00/deep/learning/2017/06/18/activations<p>In this blog post I’d like to show how commonly used activation functions can be derived from the sigmoid
activation function. As a result, we can show that these functions have a shared mathematical lineage with
the sigmoid.</p>
<ol>
<li>
<p>sigmoid:</p>
<script type="math/tex; mode=display">\begin{equation}
\sigma (x) = \frac{1}{1+e^{-x}}
\end{equation}</script>
</li>
<li>
<p>hyperbolic tangent:</p>
<script type="math/tex; mode=display">\begin{equation}
tanh (x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}
\end{equation}</script>
<p>Now, we note that:</p>
<script type="math/tex; mode=display">\begin{equation*}
\sigma (2x) = \frac{1}{1+e^{-2x}} = \frac{e^{x}}{e^{x}+e^{-x}}
\end{equation*}</script>
<p>From this it follows that we have:</p>
<script type="math/tex; mode=display">\begin{equation*}
tanh(x) = \sigma (2x)-\sigma (-2x)
\end{equation*}</script>
</li>
<li>
<p>softplus:</p>
<script type="math/tex; mode=display">\begin{equation}
f (x) = \ln(1+e^{x})
\end{equation}</script>
<p>Now, if we compute the integral of the sigmoid:</p>
<script type="math/tex; mode=display">\begin{equation*}
\int \sigma(x) dx = \int \frac{e^x}{e^x+1} dx = \ln (1+e^x)+C
\end{equation*}</script>
<p>where <script type="math/tex">C</script> is an arbitrary constant.</p>
</li>
<li>
<p>ReLU:</p>
<p>Note that in <script type="math/tex">(3)</script> when <script type="math/tex">\lvert x \rvert > 5</script>,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation*}
f(x) \approx \begin{cases}
x, x > 0\\
0, x < 0
\end{cases}
\end{equation*} %]]></script>
<p>From this we may deduce the much more computationally efficient ReLU activation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
g(x) = \begin{cases}
x, x > 0\\
0, x < 0
\end{cases}
\end{equation} %]]></script>
</li>
</ol>
<p>What I find very interesting is that although these activation functions can all be derived
from the sigmoid they have very different properties from the sigmoid. I’m not sure we can
derive all the emergent properties of a neural network with a particular function using the
tools of real analysis but this is an interesting question that I shall certainly revisit
in the near future.</p>
<p><strong>References</strong>:</p>
<ol>
<li>Understanding the difficulty of training deep feedforward neural networks (X. Glorot & Y. Bengio. 2010. AISTATS.)</li>
<li>Rectified Linear Units Improve Restricted Boltzmann Machines (V. Nair & G. Hinton. 2010. ICML.)</li>
</ol>Aidan RockeIn this blog post I’d like to show how commonly used activation functions can be derived from the sigmoid activation function. As a result, we can show that these functions have a shared mathematical lineage with the sigmoid.lab 1: unsupervised pre-training, dropout and representation learning2017-06-15T00:00:00+00:002017-06-15T00:00:00+00:00/deep/learning/2017/06/15/experiment_1<div class="image-wrapper">
<img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/labs/lab_1/activation_space.png" alt="" />
<p class="image-caption">two dimensional embedding of binary activations</p>
</div>
<h2 id="motivation">Motivation:</h2>
<p>For my first set of Pauli Space experiments, I thought I would start by attempting to answer elementary questions which might lead
to more data efficient deep models and algorithms. First, I shall explore the connection between unsupervised learning and representation
learning which is what leads to better generalisation [1]. Second, I shall focus on questions that investigate the example ordering
problem as I believe this represents a fundamental limitation of gradient-based algorithms [3]. One of these questions is whether
bernoulli dropout counteracts the example ordering problem by encouraging a globally weaker-than-exponential rate of convergence
as we have a new model that encounters new batches of examples at each iteration, effectively allowing exponentially many models
to discover exponentially many local minima.</p>
<p>Unlike most scientists, I publicly announce my hypotheses prior to performing my experiments which avoids the revisionist tendency
that is prevalent in many academic circles including machine learning. As my previous blog post made important remarks on the necessity
for better scientific methodology within the field of machine learning, I plan to uphold this standard and improve upon it as I
believe this shall naturally lead to better science.</p>
<p>Finally, the ultimate goal of this series of experiments shall be the same for all future experiments. That is to find powerful
mathematical abstractions of deep models that make verifiable predictions.</p>
<h2 id="experimental-setup">Experimental setup:</h2>
<ol>
<li>Computing:
<ol>
<li>Device: Macbook Air 13”</li>
<li>Processor: 1.6 GHz Intel Core i5</li>
<li>Memory: 8 GB 1600 MHz DDR3</li>
</ol>
</li>
<li>Data:
<ol>
<li>MNIST: 28x28 handwritten digits with 10 classes // train: 60k examples // test: 10k examples</li>
<li>CIFAR-10: 32x32 color images with 10 classes // train: 50k examples // test: 10k examples</li>
</ol>
</li>
<li>Baseline models:
<ul>
<li>fully-connected ReLU network: [784/1024,500,500,500,10]</li>
</ul>
</li>
<li>Infrastructure:
<ol>
<li>Keras</li>
<li>tensorflow</li>
</ol>
</li>
<li>Timeline:
<ol>
<li>Start: 2017-06-15</li>
<li>End: 2017-06-19</li>
</ol>
</li>
</ol>
<h2 id="experiments">Experiments:</h2>
<p>The goal of this experiment is to perform the following analyses for deep rectifier networks trained with unsupervised
pre-training vs without unsupervised pre-training and to develop theoretical interpretations of the observed results.</p>
<ol>
<li>Visualize the activation space:
<ol>
<li>Apply a binary mask to ReLU activations then concatenate binary activations so we have a binary vector per example</li>
<li>Apply PCA(n=2) to the binary vectors from the training set and visualise the resulting clusters</li>
<li>Do we observe nice clustering in the activation space?</li>
<li>Do we observe quantitative indications of variable disentangling? <br /><br /></li>
</ol>
</li>
<li>Analyse the total number of distinct binary vectors(i.e. representations) as a function of the number of classes:
<ol>
<li>What happens when we train a model on only <script type="math/tex">% <![CDATA[
n < 10 %]]></script> classes where we choose a subset of the 10 original classes?</li>
<li>Is there an observable relationship between the number of representations per class and the number of classes?</li>
<li>In particular, for fixed network size does the relative sparsity of subnetwork nodes increase as we increase the number of classes? <br /><br /></li>
</ol>
</li>
<li>Analyse the fraction of active units per class(i.e. variable-size representation):
<ol>
<li>Is the order of the variable-size representation respected when I choose sub-classes?</li>
<li>Can the empirical variable-size representation be assigned a theoretical interpretation? <br /><br /></li>
</ol>
</li>
<li>Analyse the example ordering problem:
<ol>
<li>Is the rate of change of the weight norms much more important during the early epochs compared to later epochs?</li>
<li>Does this problem exist to the same degree for all gradient-based optimizers and can we describe this relationship mathematically?</li>
<li>To what extent is this problem anticipated by models of optimizers?</li>
<li>Can we find training algorithms that are both efficient and don’t suffer from the example ordering problem?</li>
<li>Does bernoulli dropout make this problem almost non-existent for gradient-based optimizers?</li>
<li>How does weight normalisation alleviate this problem? <br /><br /></li>
</ol>
</li>
<li>How do the above analyses generalise:
<ol>
<li>To ReLU networks with really wide layers?</li>
<li>To ReLU networks with much greater depth?</li>
<li>Can we find concise mathematical descriptions for these generalisations?</li>
</ol>
</li>
</ol>
<h2 id="hypotheses">Hypotheses:</h2>
<p>Here are my conjectures for a handful of questions:</p>
<p>1.4 We will observe clear indications of variable disentangling that can be determined by the fraction of shared nodes per class, and as
the training set increases the variance of the fraction of shared nodes per class will decrease accordingly.</p>
<p>2.3 For fixed network size the relative sparsity of subnetwork nodes increase as we increase the number of classes. My argument for this
is that sparsity is a result of local competition between subnetworks and competition(i.e. complex co-adaptation) becomes a bigger
issue as we increase the number of classes [5]. This is my own interpretation of the paper by R. Srivastava.</p>
<p>3.1 The order of the variable-size representation (as measured by the fraction of active units per class) will be respected when I choose
subclasses. My argument is that if this turns out to be false then the notion of variable-size representation will have to be redefined.
At present it’s meant to capture the notion of an adaptively efficient encoding.</p>
<p>4.4 I believe we can find training algorithms that are both efficient and don’t suffer from the example ordering problem but we will need to
augment networks with memory in order to do inference forwards and backwards in time. As a first approximation, we can probably use MonteCarlo dropout to perform inference but this will have to be investigated further.</p>
<p>5.1 I believe that if I can find quantitative theoretical justifications for experimental analyses 1-4, these analyses will generalise
to large ReLU networks.</p>
<p>Note: I’ve listed the minimum number of papers which I think I must reference for this experiment but this list my expand. In particular,
if you believe there’s a paper I ought to reference please let me know.</p>
<h2 id="references">References:</h2>
<ol>
<li>Representation Learning: A Review and New Perspectives (Y. Bengio et al. 2013. IEEE Transactions on Pattern Analysis and Machine Intelligence.)</li>
<li>Dropout: A Simple Way to Prevent Neural Networks from Overfitting (N. Srivastava et al. 2014. Journal of Machine Learning Research.)</li>
<li>Why Does Unsupervised Pre-training Help Deep Learning? (D. Erhan et al. 2010. Journal of Machine Learning Research.)</li>
<li>The Physical Systems behind Optimization (L. Yang et al. 2017.)</li>
<li>Understanding Locally Competitive Networks (R. Srivastava et al. 2015.)</li>
<li>Deep Sparse Rectifier Neural Networks (X. Glorot, A. Bordes & Y. Bengio. 2011. Journal of Machine Learning Research.)</li>
</ol>Aidan Rocketwo dimensional embedding of binary activations