Jekyll2018-12-06T11:57:24+00:00/feed.xmlPauli Space
Investigations on the foundations for intelligence.
Review: Froude and the contribution of naval architecture to our understanding of bipedal locomotion(2004)2018-10-26T00:00:00+00:002018-10-26T00:00:00+00:00/robotics/2018/10/26/froude<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/froude.png" /></center>
<center>figure 1: obtained from the original paper</center>
<h2 id="overview">Overview:</h2>
<ol>
<li>
<p>When comparing two morphologically similar(i.e. geometrically similar) species that may differ in size
it’s natural to ask under what circumstances their gaits might be similar. Given that the differences in size might be important-consider
for example the difference in size between a cat and a rhinoceros-a dimensionless analysis is necessary.</p>
</li>
<li>
<p>130 years ago, it was William Froude that introduced a dimensionless parameter that proved to be an important criterion for dynamic similarity
when comparing boats of different hull lengths. In particular, dimensionless analysis using <script type="math/tex">Fr</script> proved very useful in understanding why the
<em>Great Eastern</em>, the largest ship in the world at the time, was a massive failure. In fact, the ship couldn’t earn enough to pay for its fuel.</p>
</li>
<li>
<p>Essentially, Froude found that large and small models of geometrically similar hulls produced similar wave patterns when their Froude numbers <script type="math/tex">Fr</script>
were equal. To be precise, <script type="math/tex">Fr</script> is equal to:</p>
<script type="math/tex; mode=display">\begin{equation}
Fr = \frac{\lVert v \rVert^2}{gL}
\end{equation}</script>
<p>where <script type="math/tex">v</script> is the velocity, <script type="math/tex">g</script> is the gravitational acceleration and <script type="math/tex">L</script> is the characteristic length.</p>
</li>
<li>
<p>While Froude concentrated on the movement of ships it was D’Arcy Wentworth Thompson who first recognised the connection between the Froude
number and animal locomotion. On page 23 of <em>On Growth and Form</em>, Thompson notes:</p>
<blockquote>
<p>In two similar and closely related animals, as is also in two steam engines, the law is bound to hold that the rate of working must tend to
vary with the square of the linear dimension, according to Froude’s Law of steamship comparison.</p>
</blockquote>
</li>
<li>
<p>Despite the popularity of Thompson’s work, the importance of the Froude number as a tool for analysing locomotion wasn’t fully appreciated
until Robert Alexander, a Zoology professor at Leeds, empirically demonstrated that the movement of animals of geometrically similar
form but different size would be dynamically similar when they moved with the same Froude number.</p>
</li>
</ol>
<h2 id="alexanders-dynamic-similarity-criteria">Alexander’s dynamic similarity criteria:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/rhino.jpg" /></center>
<center>figure 2: African rhino in its natural habitat</center>
<p><br /></p>
<p>One of Alexander’s most striking observations was that the galloping movements of cats and rhinoceroses are remarkably similar even though the rhino is three
orders of magnitude larger. After much empirical analysis, Alexander postulated five dynamic similarity criteria in [3]:</p>
<ol>
<li>Each leg has the same phase relationship.</li>
<li>Corresponding feet have equal duty factors (% of cycle in ground contact).</li>
<li>Relative (i.e. dimensionless) stride lengths are equal.</li>
<li>Forces on corresponding feet are equal multiples of body weight.</li>
<li>Power outputs are proportional to body weight times speed.</li>
</ol>
<p>Alexander hypothesised, and provided the necessary experimental evidence to demonstrate that animals meet these five criteria when they travel at speeds that translate to equal values of <script type="math/tex">Fr</script>. This important work by Alexander indicates that although the Froude number may appear to oversimplify complex problems in biomechanics, it has empirically proved to be an important factor in the dimensionless analysis of dynamic similarity. At this stage we may marvel at the fact that the Froude number, which
emerged from a problem in hydrodynamics, should also play a key role in the comparative analysis of terrestrial locomotion.</p>
<p>While this theoretical issue isn’t addressed in [1], I have attempted to show the mathematical connection between the Froude number as it occurs in hydrodynamics
and the Froude number as it occurs in biomechanics.</p>
<h2 id="analysis-of-the-froude-number">Analysis of the Froude number:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/pendulum.png" /></center>
<center>figure 3: The inverted pendulum as a model for bipedal walking</center>
<p><br /></p>
<p>In this section I shall demonstrate that in both the cases of a surface water wave and a bipedal walker, the Froude number provides a similar description. In fact,
if we note that a surface water wave is approximately a transverse wave and the walking motion of a biped is approximately a longitudinal wave then the Froude number
is simply the force magnitude responsible for linear displacement divided by the magnitude of the gravitational force.</p>
<h3 id="surface-water-waves">Surface water waves:</h3>
<p>Given a surface water wave moving through a medium with density <script type="math/tex">\rho</script> with constant velocity <script type="math/tex">\lVert v \rVert = \frac{L}{T}</script>(i.e. longitudinal displacement <script type="math/tex">L</script> within a period <script type="math/tex">T</script>), the magnitude of the inertial force required to halt the motion of a volume <script type="math/tex">L^3</script> is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\lVert F_i \rVert = \text{mass}*\text{acceleration} & = \rho L^3 \cdot \frac{\lVert \Delta v \rVert}{\Delta t } \\
& = \rho L^3 \cdot \frac{\lVert v \rVert}{T} \\
& = \rho L^2 \cdot \lVert v \rVert \cdot \frac{L}{T} \\
& = \rho L^2 \cdot \lVert v \rVert^2 \\
\end{split}
\end{equation} %]]></script>
<p>On the other hand, the magnitude of the gravitational force acting on this volume is given by:</p>
<script type="math/tex; mode=display">\begin{equation}
\lVert F_g \rVert = \text{mass}*\text{gravitational acceleration} = \rho L^3 \cdot \lVert g \rVert
\end{equation}</script>
<p>and we may define the Froude number in terms of the ratio of these force magnitudes:</p>
<script type="math/tex; mode=display">\begin{equation}
Fr = \frac{\lVert F_i \rVert}{\lVert F_g \rVert} = \frac{\rho L^2 \cdot \lVert v \rVert^2}{\rho L^3 \cdot \lVert g \rVert } = \frac{\lVert v \rVert^2}{Lg}
\end{equation}</script>
<p>and this number describes the stability of the flowing wave as shown in <a href="https://www.youtube.com/watch?v=7AKNR881BFE">this video</a>. In particular, when <script type="math/tex">% <![CDATA[
Fr < 1 %]]></script> the shallow water wave is stable and the motion of the wave is dominated by gravitational forces so surface waves generated by downstream disturbances can travel upstream but when <script type="math/tex">Fr > 1</script> this is impossible.</p>
<h3 id="bipedal-walkers">Bipedal walkers:</h3>
<p>The inverted pendulum is a useful model for analysing bipedal walking as a leg forms the radius of an arc and the motion of the biped may then be approximated by a longitudinal wave. Furthermore, if we make reasonable modelling assumptions we may infer the speed limits on a bipedal walker.</p>
<p>In particular, we make the following assumptions:</p>
<ol>
<li>We neglect air resistance.</li>
<li>We assume that the legs are rigid and interact with a single point on the ground.</li>
<li>We neglect any pelvic motion.</li>
<li>We neglect the inertial role of arm motions.</li>
</ol>
<p>Given these assumptions, note that the force magnitude associated with motion in a circular arc is given by:</p>
<script type="math/tex; mode=display">\begin{equation}
\lVert F \rVert = \frac{M \lVert v \rVert^2}{L}
\end{equation}</script>
<p>where <script type="math/tex">M</script> is the mass of the biped, <script type="math/tex">\frac{\lVert v \rVert^2}{L}</script> is the inward acceleration of the mass, <script type="math/tex">v</script>
is the tangential velocity and <script type="math/tex">L</script> is the radius of the circular orbit(i.e. the limb length).</p>
<p>Note further that during walking, the inward acceleration due to normal forces on the foot shouldn’t exceed the gravitational acceleration:</p>
<script type="math/tex; mode=display">\begin{equation}
g > \frac{\lVert v \rVert^2}{L}
\end{equation}</script>
<p>Meanwhile, if we consider the ratio of the centripetal force magnitude to the ratio of the gravitational force magnitude we have:</p>
<script type="math/tex; mode=display">\begin{equation}
Fr = \frac{\lVert F_c \rVert}{\lVert F_g \rVert} = \frac{\frac{M \lVert v \rVert^2}{L}}{Mg}=\frac{\lVert v \rVert^2}{gL}
\end{equation}</script>
<p>and in order to allow stable walking we must have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
Fr < 1 \implies \lVert v \rVert < \sqrt{gL}
\end{equation} %]]></script>
<p>so the maximum walking speed of a biped is proportional to the square root of <script type="math/tex">L</script>, the length of its legs. Likewise,
when <script type="math/tex">Fr > 1</script> running is necessary. It follows that both in the case of a shallow water wave and bipedal walkers, <script type="math/tex">Fr \approx 1</script>
defines the boundary between fundamentally different dynamics.</p>
<h2 id="open-questions">Open questions:</h2>
<ol>
<li>Does the implication of dynamic similarity via equal Froude number hold for nonlinear motions?</li>
<li>The assumption of constant velocity appears to require steady-state assumptions. Can the Froude number be generalised to handle the case of intermittent locomotion?</li>
</ol>
<p>These questions might have been answered by researchers in the robotics and biomechanics community but at this point I myself don’t have
satisfactory answers.</p>
<h1 id="references">References:</h1>
<ol>
<li>Froude and the contribution of naval architecture to our understanding of bipedal locomotion. C. Vaughan & M. Malley. 2004.</li>
<li>On Growth and Form. D’Arcy Wentworth Thompson. 1917.</li>
<li>A dynamic similarity hypothesis for the gaits of quadrupedal mammals. Alexander RM, Jayes AS. 1983.</li>
</ol>Aidan Rockefigure 1: obtained from the original paperReview: A new galloping gait in an insect(2013)2018-10-23T00:00:00+00:002018-10-23T00:00:00+00:00/robotics/2018/10/23/beetles<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/beetle_results.png" /></center>
<center>figure 1: obtained from the original paper</center>
<h2 id="main-results">Main results:</h2>
<ol>
<li>
<p>An estimated 3 million insect species all walk using variations of the alternating tripod gait where at any instant these organisms
hold one stable triangle of legs steady while swinging the opposite triangle forward.</p>
</li>
<li>
<p>In this paper, the authors report the discovery that three different flightless beetles use an additional gallop-like gait which has
never been associated with any insect before.</p>
</li>
<li>
<p>Like a bounding hare, three variants of the Pachysoma beetles propel their body forward by synchronously stepping with both middle legs
and then both front legs. No aerial phase occurs but the leg coordination is that of a gallop.</p>
</li>
<li>
<p>Although <em>P. endroedyi</em>, <em>P. hippocrates</em> and <em>P. glentoni</em> can walk using the normal tripod gait these beetles usually employ a unique
galloping gait where they move each pair synchronously, stepping alternating with the front and middle legs. The hind legs are dragged
behind even if the beetle carries no load and seem to contribute little to propulsion.</p>
</li>
<li>
<p>The authors found no speed advantage for the <em>Pachysoma</em> galloping gait. On the contrary, when observing beetles on sandpaper or fabric
surfaces(to provide grip) the authors measured running speeds in the tripod-walking <em>P. striatium</em> that were significantly faster than the
sympatric, similarly-sized <em>P. endroedyi</em>. This is true both in absolute terms(<script type="math/tex">9.1 \pm 1.5</script> cm/s vs. <script type="math/tex">7.6 \pm 1.6</script> cm/s) and in relation
to body size(<script type="math/tex">4.2 \pm 0.9</script> bodylengths/s vs. <script type="math/tex">3.1 \pm 0.6</script> cm/s).</p>
</li>
</ol>
<p>I find this result important because I actually formulated my galloping hexapod hypothesis before learning about this paper. The authors’ observation
doesn’t contradict my hypothesis that hexapods, regardless of brain wiring, aren’t suitable for galloping gaits.</p>
<h2 id="methodology">Methodology:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/beetle.png" /></center>
<center>figure 2: obtained from the accompanying video</center>
<ol>
<li>
<p>10 beetles each of <em>P. striatium</em> and <em>P. endroedyi</em> were used but the authors don’t provide details concerning how their speeds were
measured on sandpaper:</p>
<p>a. What devices were used?</p>
<blockquote>
<p><span style="color:steelblue">From my correspondence with the main author, Jochen Smolka, I learned that the beetles were filmed from above with a high-speed camera. The movement of their legs was manually tracked (by clicking on the position of each leg, frame-by-frame) afterwards using their our own video tracking software).</span></p>
</blockquote>
<p>b. What distance was used?</p>
<blockquote>
<p><span style="color:steelblue">If we analyse figure 2, the distance appears to be approximately one foot. According to Dr. Smolka, these platforms were also used for some other experiments, where they were introduced into the beetles walking path as ramps that the beetles had to cross; this was a consideration for the size. Moreover, Dr. Smolka claims that in a pilot experiment, they calculated the speed of several beetles that were moving naturally in the field, and found no difference. I’ll assume they used a two-sided t-test here.</span></p>
</blockquote>
<p>c. Assuming the beetles had the same starting position <script type="math/tex">A</script>, how were they motivated to reach the endpoint <script type="math/tex">B</script>?</p>
<blockquote>
<p><span style="color:steelblue">According to Dr. Smolka, beetles keep going in a straight line even if you pick them up and place them elsewhere. Smolka claims that this is their natural exploration behaviour as a straight line is the easiest way to make sure you don’t check the same spot twice (for food, mates, etc.). To be honest, I’m actually quite surprised by the exploration behaviour<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> of P. endroedyi. I would have expected some variation of Lévy flight search rather than a straight line. I must add that the authors have mainly compared the average speed of exploration. This is probably different from their maximum speeds when fleeing from a predator. On this point Smolka notes that they selected beetle species with the least developed predator avoidance response and are therefore easiest to handle. Moreover he adds that there is usually nowhere to go for these small, flightless desert beetles, especially when they don’t have a burrow, so running away isn’t a good option.</span></p>
</blockquote>
</li>
<li>
<p>Why is it that of the three galloping <em>Pachysoma</em> beetles, only the galloping speeds of <em>P. endroedyi</em> were measured in a laboratory?</p>
</li>
</ol>
<blockquote>
<p><span style="color:steelblue">On this matter I just received an email from the main author, Jochen Smolka, and he provided two important reasons. First, the two other beetle species are quire rare and so finding a large enough sample size would have been difficult. Second, <em>P. hippocrates</em> and <em>P. glentoni</em> are extremely skittish so filming from above the sandpaper platform would have been impossible. </span></p>
</blockquote>
<h2 id="open-questions">Open questions:</h2>
<p>Assuming that the authors’ analysis holds and <em>P. endroedyi</em> is representative of all galloping <em>Pachysoma</em>, we may ask why this rare gait evolved.
In particular, the authors ask:</p>
<ol>
<li>
<p>Does it provide an advantage in terms of energy consumption or mechanical stress?</p>
</li>
<li>
<p>Does it make it easier to move straight or stabilise the head and eyes while transporting large loads across shifting sands?</p>
</li>
</ol>
<p>Besides these last two questions, I think it may be interesting to explore a mechanical model of <em>P. endroedyi</em> in a computer simulation
and try to figure out whether there are other dynamically stable galloping gaits that <em>P. endroedyi</em> could use which involve the use of all
six limbs.</p>
<h1 id="references">References:</h1>
<ol>
<li>A new galloping gait in an insect. Smolka et al. 2013.</li>
</ol>
<h1 id="footnotes">Footnotes:</h1>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Dr. Smolka also provided the following anecdotal evidence concerning two phases in <em>P. endroedyi</em> exploratory behaviour. When they emerge from the ground (after feeding/hibernating for many days, weeks, or months), they perform straight-line search until they find a significant food source (dung or, for some species, plant detritus). They then build a burrow nearby (which can take an hour or more). Once they emerge from this new feeding burrow, they show more of a “random walk”-type search, presumably because they do not want to get too far away from this burrow. (Some of the larger species often only forage within a few centimetres distance once they have built a burrow.) <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aidan Rockefigure 1: obtained from the original paperReview: The co-ordination of insect movements(1951)2018-10-23T00:00:00+00:002018-10-23T00:00:00+00:00/robotics/2018/10/23/insects<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/slow_motion.png" /></center>
<center>figure 1: obtained from the original paper</center>
<h2 id="background">Background:</h2>
<ol>
<li>
<p>Most researchers at the time of publication(1951) considered that insect movement was accomplished via the alternation of two tripods
of support, each composed of the foreleg and hindleg on one side, together with the middle leg on the opposite side. As a result, although
insects have six legs their movement could be analysed as if they only had two alternating groups.</p>
</li>
<li>
<p>Demoor in particular extended the dynamics of the alternating groups of diagonal legs to include majority of Arthropods. He concluded
that the movement of insects conformed to the general description that walking is a series of checked falls.</p>
</li>
<li>
<p>The concept that walking is a series of checked falls seems to have little meaning except in the case of bipeds or tetrapods that are
moving fast and therefore possess dynamic stability.</p>
</li>
<li>
<p>The walking tetrapod may stop at any phase of its walking cycle and not fall over, as the center of gravity lies within the area of support.</p>
</li>
<li>
<p>At the time of publication, a general consensus emerged concerning the roles fulfilled by the three pairs of legs during walking. The forelegs
were considered to have a tractive function, the hind pair to propel, while the middle legs act as fulcra.</p>
</li>
</ol>
<p>Finally, in the introduction the authors claim that in the present century the focus of research has moved from the mechanism of movement to a
study of the role of the brain and different parts of the nervous system in the co-ordination of locomotory movements. This statement is interesting
because the authors then make an original discovery concerning insect movement that is simple, purely biomechanical and yet general. I think it also
says something important about the effective complexity of neural circuits controlling the locomotion of terrestrial insects.</p>
<h2 id="materials-and-methods">Materials and Methods:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/sinclair_cine.jpg" /></center>
<center>figure 2: A Sinclair ciné camera</center>
<ol>
<li>The authors made detailed studies on the cockroaches <em>Periplaneta americana</em> and <em>Blatta orientalis</em> & various beetles <em>Dysticus marginalis</em>,
<em>Hydrophilus piceus</em>, <em>Carbus violaceus</em>, <em>Chrysomala orichalcea</em> and <em>Blaps murconata</em>.</li>
<li>According to the authors, the walking movements of the other insects did not appear to differ from their selection of species in any essential
feature. <em>I’m not sure whether to take this statement at face value.</em></li>
<li>The insects were usually filmed from above as they walked over graph paper but simultaneous side and ventral views, sustained with the aid of
a mirror were most valuable in describing the cycles of individual leg movements.</li>
<li>A Sinclair ciné camera was used at speeds of 16-32 frames per second with illumination provided by two Photoflood lights at a distance of 3 ft.
According to the authors, this lighting disturbed the cockroaches but their negative phototaxis assisted the photography as they ran towards a
dark box placed at one side of the field.</li>
</ol>
<h2 id="results">Results:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/force_diagram.png" /></center>
<center>figure 3: An analysis of force on the legs of the cockroach</center>
<p>In order to understand the results it’s first necessary to introduce the following terms:</p>
<p><strong>Protraction:</strong> the complete movement forwards of the whole limb relative to its articulation with the body.</p>
<p><strong>Retraction:</strong> describes the remaining half of the cycle between the instant when the leg is placed on the ground and the time it’s raised and protraction begins</p>
<p>In theory, the duration of protaction and retraction should be equivalent but there’s always a small delay between the protraction of a tripod of limbs.
Now, as a result of the authors’ investigation they have discovered the following pair of rules:</p>
<ol>
<li>No fore or middle leg is protracted until the leg behind has taken up its supporting position.</li>
<li>Each leg alternates with the contralateral limb of the same segment.</li>
</ol>
<p>Furthermore, if we define <script type="math/tex">p</script> to be the protraction time and <script type="math/tex">r</script> to be the retraction time the authors make the following general observations:</p>
<ol>
<li>
<p>An increase in speed is generally accompanied by a decrease in the times of both protraction(p) and retraction(r). This increase in stride frequency
is then accompanied by a reduction in stride length and an increase in the distance between successive points of support. The range of speeds is continuous
and no distinction could be made between walking and running.</p>
</li>
<li>
<p>A system of rigidly alternating tripods would result in <script type="math/tex">\frac{p}{r}=1</script> but this is never quite realised as there is always a small delay between
the protraction of three legs of a triangle.</p>
</li>
<li>
<p>It’s concluded that insects are the end-product of a process of limb reduction among terrestrial Arthropods in which <script type="math/tex">\frac{p}{r} \rightarrow 1</script>
and yet the animal retains static stability throughout the whole cycle.</p>
</li>
</ol>
<p>I find the last point quite ambitious and I think more detailed justifications are desirable although the authors defend this proposition well.</p>
<h2 id="remarks">Remarks:</h2>
<ol>
<li>I find this paper very interesting as it suggests that most tripod gaits have a relatively simple physical explanation and suggests that neural
mechanisms for control of terrestrial insect locomotion should be relatively simple.</li>
<li>I think that numerical experiments with a variety of tripod gaits using the rules given in the results section could provide insight into the
special stability advantages of six-limbed organisms. This is something I’ll look into soon but it would require specifying an insect morphology
that approximates its biomechanics quite well. I wouldn’t mind doing this for <em>P. endroedyi</em> described in [2] but this would take some time to do properly.</li>
<li>There is a lot more in this paper, entire pages, concerning the physiological constraints on Arthropod locomotion that I haven’t included in this review.
They aren’t of direct interest to me right now but they are also very well written and I’ll probably return to those sections in the near future.</li>
</ol>
<h1 id="references">References:</h1>
<ol>
<li>The co-ordination of insect movements. G.M. Hughes. 1951.</li>
<li>A new galloping gait in an insect. Smolka et al. 2013.</li>
</ol>Aidan Rockefigure 1: obtained from the original paperWhy don’t hexapods gallop?2018-10-21T00:00:00+00:002018-10-21T00:00:00+00:00/robotics/2018/10/21/hexapods<h2 id="motivation">Motivation:</h2>
<p>Considering that the fastest observed land animals are quadrupeds that employ a rectilinear gallop, we are led to consider
three natural questions. First, in the presence of important gravitational effects(i.e. greater than 1 Newton) are four limbs necessary
and sufficient for the rectilinear gallop? Second, what mathematical argument could explain the relative lack of galloping hexapods?
Finally, if we think of gait transitions what other gaits are improbable for hexapods if galloping gaits are excluded?</p>
<p>In order to clarify the difficulty of these questions, the first two sections involve back-of-the-envelope calculations which provide a partial
understanding of the problem. While it’s acknowledged that these mathematical arguments are insufficent, the point of these calculations is to
motivate an altogether different approach to these questions. In particular, the author conjectures that all three questions may be suitably
addressed by combining developments in soft robotics and unsupervised reinforcement learning.</p>
<h2 id="four-limbs-are-necessary-and-sufficient-for-galloping">Four limbs are necessary and sufficient for galloping:</h2>
<p><em>Due to the absence of consistent axiomatic foundations for biomechanics, which precludes the possibility of a mathematical proof,
what follows is a back-of-the-envelope calculation.</em></p>
<p>Given that physical space is approximately Euclidean(i.e. <script type="math/tex">\mathbb{R}^3</script>), rapid rectilinear locomotion is a matter of survival for any
organism that must traverse the shortest distance between two points on approximately flat terrain. This is the case for many terrestrial
mammals such as the rabbit rushing to its burrow and-by symmetry-its terrestrial predators. Mammalian quadrupeds such as the cheetah, greyhound
and rabbit provide an existence proof that four legs are sufficient for the rectilinear gallop.</p>
<p>To address the issue of necessity, we shall define the gallop as an asymmetric gait on rough surfaces with three properties at terminal velocity:</p>
<ol>
<li>It has two phases: one phase where its feet are on the ground and another phase where its feet are off the ground.</li>
<li>The forelimbs <script type="math/tex">\mathcal{T}</script> are specialised for traction whereas the hind limbs <script type="math/tex">\mathcal{P}</script> are specialised for propulsion.</li>
<li>In order to nullify the effect of moments around the organism’s center of mass, a rectilinear gallop must satisfy:</li>
</ol>
<script type="math/tex; mode=display">\begin{equation}
|\mathcal{T}|,|\mathcal{P}| \in 2\mathbb{N}
\end{equation}</script>
<p>and from this it follows four legs are necessary for galloping gaits(i.e. rotary or transverse gallop).</p>
<h2 id="are-hexapods-unsuitable-for-galloping">Are hexapods unsuitable for galloping?:</h2>
<p>Before we continue it’s important to note that there is a serious problem with the previous argument.
Due to the existence of galloping quadrupeds the argument of sufficiency can’t be contested but what
about the argument of necessity? Have we actually disproven that an organism with three limbs can’t
gallop? No, we haven’t.</p>
<p>Now, let’s consider the challenge of laying out an argument that hexapods aren’t suitable for galloping. This may be true
but we would need to have a general model of any physically-realisable hexapod. Even if we had irrefutable axioms of
biomechanics it’s not clear that such a conclusion may be deduced from pure analysis alone. For such a complex system
we may not be able to gain more than partial insights and computer simulations of some kind would probably be necessary.</p>
<h2 id="limitations-of-reductionist-models">Limitations of reductionist models:</h2>
<p>The challenge for any mathematically convincing explanation is that it implicitly requires that we use dimensional analysis which depends on the identification of a finite set of parameters governing the dynamics of locomotion. As pointed out by the authors of [4], this is problematic
for a complex system such as an animal as an infinite number of parameters may be necessary to describe the system. There are all kinds of biological constraints
such as the maximal heart rate, VO2 max, the organisation of the animal’s nervous system, as well as biological or environmental constraints that we ignore.</p>
<p>In practice, a reduction to a small-parameter model is possible by specifying a particular set of dynamic observables to model(ex. animal speed, specific locomotory behaviours). However, it must be noted that the original organism has now been projected onto a potentially arbitrary set of dynamic observables that are particularly easy for humans to measure in a lab. Such a simplified model might provide us with useful insights but how do we arbitrate between a number of such models. In fact, given enough time
we may discover an infinite number of mathematical laws associated with the quadrupedal gallop but none which explain why the quadrupedal gallop emerges in an ecologically realistic setting.</p>
<p>An even more serious problem in our case is that there are practically no galloping hexapods. In this case, how should a model of galloping hexapods be developed? Should we attempt to construct hexapods that gallop and analyse the determinants of their locomotory behaviour? How much time and money should we spend searching through the space of plausible hexapod morphologies? And what is a good method for conducting such a search?</p>
<h2 id="the-method-of-the-artificial">The method of the artificial:</h2>
<p>To simplify my task, I should probably start by addressing the core assumption in this article. Is galloping somehow optimal for rectilinear motion in quadrupeds?
This question requires the construction of a quadrupedal model embedded in an ecologically realistic setting(i.e. with quadrupedal predators/prey) which isn’t hardwired to gallop and from which galloping gaits spontaneously emerge. Clarifying what I mean by ecologically realistic will be an important challenge. In any case, it would probably require a departure from reductive approaches to addressing the nonexistence of galloping hexapods.</p>
<p>Two reasonable approaches to analysing complex systems are described by Pierre Oudeyer in [8]:</p>
<blockquote>
<p>The first, used by mathematicians and some theoretical biologists, consists in abstracting from the phenomenon of language a certain number of variables along with the rules of their evolution in the form of mathematical equations. Most often this resembles systems of coupled differential equations, and benefits from the framework of dynamic systems theory. The second type, which allows for modelling of more complex phenomena than the first, is that used by researchers in artificial intelligence: it consists in the construction of artificial systems implemented in computers or in robots. These artificial systems are made of programs which most often take the form of artificial software or robotic agents, endowed with artificial brains and bodies. These are then put into interaction with an artificial environment (or a real environment in the case of robots), and their dynamics can be studied. This is what one calls the “method of the artificial” (Steels, 2001)</p>
</blockquote>
<p>I think the method of the artificial would probably lead to more important insights into quadruped locomotion than exclusive use of reductive methods. That said, I am also not advocating the rejection of reductive methods. On the contrary, I find that they are a very useful starting point for analysing complex systems and generally provide economical insights into the behaviour of a complex system.</p>
<h2 id="discussion">Discussion:</h2>
<p>This question of whether quadrupeds are optimal for galloping locomotion first occurred to me approximately four years ago. In fact, <a href="https://biology.stackexchange.com/questions/21772/why-dont-mammals-have-more-than-4-limbs">one of my most popular questions</a> on the biology stackexchange queried the nonexistence of six-limbed mammals.
Two years later in 2016, I wondered whether <a href="https://physics.stackexchange.com/questions/267962/why-dont-hexapods-gallop">the nonexistence of galloping hexapods</a> could be explained by Newtonian mechanics. I actually had a positive email exchange with Steve Heim, a roboticist that answered my question. We agreed that my mathematical
arguments couldn’t settle the problem in a definitive manner and at the end of our exchange I considered that it might be impossible to resolve this problem in a definitive manner regardless of my mathematical ability.</p>
<p>Two years after, it’s 2018 and I think I now have a reasonable strategy for approaching this problem. I am no longer going to aim for a resolution of this problem in a definitive manner using reductive approaches and my earlier suppositions may be considered working hypotheses in a weak sense. Instead I think that I can gain insight into this problem using a constructive approach whereby the morphology of the hexapod is an experimental variable and locomotory behaviours are discovered by the organism using unsupervised reinforcement learning.</p>
<h1 id="references">References:</h1>
<ol>
<li>The co-ordination of insect movements. G.M. Hughes. 1951.</li>
<li>A new galloping gait in an insect. Smolka. 2013.</li>
<li>Froude and the contribution of naval architecture to our understanding of bipedal locomotion. Vaughan et al. 2005.</li>
<li>Gait and speed selection in slender inertial swimmers. Gazzola, Argentina & Mahadevan. 2015.</li>
<li>Criteria for dynamic similarity in bouncing gaits. Bullimore & J.M. Donelan. 2007.</li>
<li>Comparing Smooth Arm Movements with the Two-Thirds Power Law and the Related Segmented-Control Hypothesis. Richardson & Flash. 2002.</li>
<li>Why Change Gaits? Dynamics of the Walk-Run Transition. F. Diedrich and W. Warren. 1995.</li>
<li>Self-Organization: Complex Dynamical Systems in the Evolution of Speech. P. Oudeyer. 2011.</li>
</ol>Aidan RockeMotivation:Controlling a unicycle with Policy Gradients2018-05-09T00:00:00+00:002018-05-09T00:00:00+00:00/reinforcement/learning/2018/05/09/policy_gradients<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/unicycle_image.png" align="middle" /></center>
<h2 id="motivation">Motivation:</h2>
<p>A few weeks ago I spent some time reflecting on McGeer’s passive dynamic walkers-which walk smoothly down an inclined plane without any digital computation-and wondered whether a reinforcement learning algorithm could discover similar gaits for flat terrain [1]. From a reinforcement learning perspective, Policy Gradients may be the simplest approach that may be used to address bipedal dynamics yet these methods have demonstrated great effectiveness on continuous control tasks [2]. For this reason, I decided to start investigating this problem with Vanilla Policy Gradients.</p>
<p>A secondary motivation came from <a href="http://www.argmin.net/2018/02/20/reinforce/"><em>The Policy of Truth</em></a> article where Ben Recht, a professor of optimisation and machine learning at Berkeley, presented a harsh critique of Policy Gradient methods and the usage of stochastic policies in particular. While Policy Gradient methods definitely have important limitations I demonstrate that the points raised by Ben Recht are non-issues. On the other hand, it must be noted that bipedal walkers can’t be reduced to reinforcement learning. If anything, considering that McGeer’s cleverly-designed walkers don’t do any computations, the embodiment of the bipedal walker must be taken seriously.</p>
<p>In order to get started on a simple problem<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, I decided to <a href="https://github.com/pauli-space/RL_unicycle_control">apply Policy Gradients to unicycles</a> which-like passive dynamic walkers-are dynamically similar to inverse pendulums.</p>
<h2 id="the-policy-gradients-formalism">The Policy Gradients formalism:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/rl.png" align="middle" /></center>
<h3 id="trajectories-policies-and-rewards">Trajectories, Policies and Rewards:</h3>
<p>A trajectory(aka rollout) is a sequence of states <script type="math/tex">s_t</script> and actions <script type="math/tex">u_t</script> generated by a dynamical system:</p>
<script type="math/tex; mode=display">\begin{equation}
\tau_t = (u_0,...,u_t,s_0,...,s_t)
\end{equation}</script>
<p>and a policy <script type="math/tex">\pi_{\theta} (\cdot \lvert s)</script> is a conditional distribution on an agent’s actions <script type="math/tex">a_t \in A</script> given states <script type="math/tex">s_t \in S</script>. This conditional
distribution is typically parametrised by a function approximator(ex. neural network) with parameters <script type="math/tex">\vartheta</script> and serves as a stochastic policy from which we
can sample actions: <script type="math/tex">a_t \sim \pi_{\vartheta} (\cdot \lvert s_{t})</script>.</p>
<p>Now, the objective of Policy Gradients is to find a policy <script type="math/tex">\pi_{\vartheta}</script> that maximises the total reward after <script type="math/tex">H</script> time steps:</p>
<script type="math/tex; mode=display">\begin{equation}
R(\tau) = \sum_{t=0}^H R(s_t,u_t)
\end{equation}</script>
<h3 id="derivation-of-policy-gradients">Derivation of Policy Gradients:</h3>
<p>Given the reward function <script type="math/tex">(2)</script>, our objective function is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
U(\vartheta) & = \mathbb{E}\big[\sum_{t=0}^H R(s_t,u_t);\pi_{\vartheta}\big] \\
& = \sum_{\tau} P(\tau;\vartheta) R(\tau)
\end{split}
\end{equation} %]]></script>
<p>where <script type="math/tex">P(\tau;\vartheta)</script> is the probability distribution over trajectories induced by <script type="math/tex">\pi_{\vartheta}</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
P(\tau;\vartheta) = \prod_{t=0}^H \underbrace{P(s_{t+1}^i \lvert s_t^i,u_t^i)}_\text{dynamics} \underbrace{\pi_{\vartheta}(u_t^i \lvert s_t^i)}_\text{policy}
\end{equation}</script>
<p>Having defined <script type="math/tex">(3)</script>, the Policy Gradients update may be derived as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\nabla_{\vartheta} U(\vartheta) & = \nabla_{\vartheta} \sum_{\tau} P(\tau;\vartheta) R(\tau) \\
& = \sum_{\tau} \nabla_{\vartheta} P(\tau;\vartheta) R(\tau) \\
& = \sum_{\tau} P(\tau;\vartheta) \frac{\nabla_{\vartheta} P(\tau;\vartheta)}{P(\tau;\vartheta)}R(\tau) \\
& = \sum_{\tau} P(\tau;\vartheta) \nabla_{\vartheta} \ln P(\tau;\vartheta) R(\tau)
\end{split}
\end{equation} %]]></script>
<p>A reasonable monte-carlo approximation to <script type="math/tex">(5)</script> would be:</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla_{\vartheta} U(\vartheta) \approx \widehat{g} = \sum_{i=1}^N \nabla_{\vartheta} \ln P(\tau^i;\vartheta) R(\tau^i)
\end{equation}</script>
<p>where each <script type="math/tex">\tau^i</script> denotes a distinct trajectory generated by running a simulator with policy <script type="math/tex">\pi_{\vartheta}</script>.
Crucially, this approximation holds regardless of the dynamics of the environment:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\nabla_{\vartheta} \ln P(\tau^i;\vartheta) & = \nabla_{\vartheta} \ln \big[\prod_{t=0}^H P(s_{t+1}^i \lvert s_t^i,u_t^i) \pi_{\vartheta}(u_t^i \lvert s_t^i)\big]\\
& = \nabla_{\vartheta} \big[\sum_{t=0}^H \ln P(s_{t+1}^i \lvert s_t^i,u_t^i) + \sum_{t=0}^H \ln \pi_{\vartheta}(u_t^i \lvert s_t^i)\big] \\
& = \nabla_{\vartheta} \sum_{t=0}^H \ln \pi_{\vartheta}(u_t^i \lvert s_t^i)
\end{split}
\end{equation} %]]></script>
<p>What is Policy Gradients doing?</p>
<ol>
<li>Increasing the probability of paths with positive reward.</li>
<li>Decreasing the probability of paths with negative reward.</li>
</ol>
<p>But, what if rewards are strictly non-negative?</p>
<h3 id="reducing-the-variance-of-widehatg-with-baselines">Reducing the variance of <script type="math/tex">\widehat{g}</script> with baselines:</h3>
<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/actor-critic.png" align="middle" /></center>
<p>In environments where there are no negative rewards we may extract more signal from our data by using baselines. For concreteness,
we may use a constant baseline:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\nabla_{\vartheta} U(\vartheta) \approx \widehat{g} & = \sum_{i=1}^N \nabla_{\vartheta} \ln P(\tau^i;\vartheta) (R(\tau^i)-b) \\
b & = \frac{1}{N} \sum_{i=1}^N R(\tau^i)
\end{split}
\end{equation} %]]></script>
<p>or even better, we may use a state-dependent baseline <script type="math/tex">b(s_t)</script> which estimates the expected value of the current state
by minimising <script type="math/tex">\lVert b(s_t)-R_t \rVert^2</script> over all trajectories during training. In this case <script type="math/tex">b</script> is a value-estimator
,more often than not parametrised by a neural network, that allows the Policy Gradients model to do temporal difference learning
and therefore exploit the sequential structure of the decision problem. Moreover, using the advantage estimate <script type="math/tex">\widehat{A_t} = R_t-b(s_t)</script> rather
than the reward reduces variance of the gradient estimate as it extracts more signal from the observations by telling the model
how much the current action is better than what is normally done in state <script type="math/tex">s_t</script>.</p>
<p>Furthermore, it’s very useful to note subtracting baselines doesn’t introduce bias into the expectation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\mathbb{E}[\nabla_{\vartheta} \ln P(\tau;\vartheta) b] & = \int P(\tau;\vartheta) \nabla_{\vartheta} \ln P(\tau;\vartheta) b d\tau \\
& = \int \nabla_{\vartheta} P(\tau;\vartheta) b d\tau \\
& = b \nabla_{\vartheta} \int P(\tau;\vartheta) d\tau \\
& = b \nabla_{\vartheta} 1 = 0
\end{split}
\end{equation} %]]></script>
<h3 id="vanilla-policy-gradients">Vanilla Policy Gradients:</h3>
<p>By putting the above ideas together, we end up with a variant of the Vanilla Policy Gradients algorithm:</p>
<ol>
<li>Initialise the policy parameter <script type="math/tex">\vartheta</script> and baseline <script type="math/tex">b</script></li>
<li>For <script type="math/tex">\text{iter}=1,..,\text{maxiter}</script> do: <br />
a. Collect a set of trajectories <script type="math/tex">\{\tau^i\}_{i=1}^N</script> by executing the current policy <script type="math/tex">\pi_{\vartheta}</script> <br />
b. At each time step in the trajectory <script type="math/tex">\tau^i</script> compute <script type="math/tex">R_t = \sum_{t'=t}^{T-1} \gamma^{t'-t}r_t</script> and <script type="math/tex">\widehat{A_t} = R_t-b(s_t)</script>. <br />
c. Re-fit the baseline, by minimising <script type="math/tex">\lVert b(s_t)-R_t \rVert^2</script> summed over all trajectories and time steps. <br />
d. Update the policy <script type="math/tex">\pi_{\vartheta}</script> using a policy gradient estimate <script type="math/tex">\widehat{g}</script> which is a sum of terms <script type="math/tex">\nabla_{\vartheta} \ln \pi_{\vartheta}(u_t \lvert s_t) \widehat{A_t}</script></li>
<li>end for</li>
</ol>
<p>This is the algorithm I used to train the unicycle controller. But, before describing my implementation I’d like to address Ben Recht’s claim that continuous control researchers have no good reason to use stochastic policies.</p>
<h2 id="why-do-we-use-stochastic-policies">Why do we use stochastic policies?:</h2>
<p>In <a href="http://www.argmin.net/2018/02/20/reinforce/"><em>The Policy of Truth</em></a>, Ben Recht is rather dismissive of Policy Gradient methods and argues that stochastic policies are a modeling choice that is never better than using deterministic policies and optimal control methods in general. In particular, Ben argues that if the correct policy is deterministic then the probabilistic model class must must satisfy several show-stopping constraints:</p>
<ol>
<li>rich enough to approximate delta functions</li>
<li>easy to search by gradient methods</li>
<li>easy to sample from</li>
</ol>
<p>Why then do reinforcement learning researchers use Policy Gradient methods? There are several important good reasons:</p>
<ol>
<li>In high-dimensional state-spaces(ex. images) where we might use convolutional neural networks to learn a low-dimensional state-representation <script type="math/tex">\widehat{s_t} \approx s_t</script>, state-aliasing is practically inevitable and so it makes sense to use stochastic policies to reflect the model’s uncertainty.</li>
<li>Sampling from a stochastic policy is a convenient method of exploring the state-space.</li>
<li>Stochastic policies effectively smooth out rough/discontinuous reward landscapes and allow the agent to obtain reward signals it couldn’t obtain otherwise.</li>
</ol>
<p>Moreover, Ben’s three points are actually non-issues:</p>
<ol>
<li>Neural networks, due to Cybenko’s approximation theorem, may be used to parametrise any probability distribution. Empirically, they are also easy to search in the space of
expressible functions using gradient methods. This is why Variational Inference has found many practical applications and <a href="http://edwardlib.org/tutorials/variational-inference">Edward</a> has gained a lot of traction among statistical machine learning researchers
and engineers.</li>
<li>Neural networks are easy to sample from. In fact, inference with neural networks is computationally cheap.</li>
<li>A corollary of my first point is that we may easily approximate delta functions. If your policy is a conditional Gaussian,which is what I used for controlling the unicycle, then your distribution definitely contains good approximations of delta functions.</li>
</ol>
<p>To be honest, I am surprised that Ben Recht didn’t bring up any real issues such as curriculum learning<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> and transfer learning where a lot of active research is currently ongoing[5]. Moreover, it’s important to note that Policy Gradient methods were developed for problems where optimal control theory, Ben Recht’s proposed alternative, doesn’t work at all.</p>
<h2 id="controlling-a-unicycle-with-policy-gradients">Controlling a unicycle with Policy Gradients:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/unicycle_image.png" align="middle" /></center>
<h3 id="modelling-assumptions">Modelling assumptions:</h3>
<p>As shown in [4] iff we suppose that our unicyclist is riding on a level surface without turning, motion in the wheel plane may be modelled by a planar inverted pendulum of diameter <script type="math/tex">l</script> with horizontally moving support. After cancelling out the mass term, an analysis of the force diagrams shows that we have the following Newtonian equation
of motion:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
l \ddot{\theta} & = g \sin(\theta)-\ddot{z}\cos(\theta) \\
\ddot{z} & = gh(\theta,\dot{\theta})
\end{split}
\end{equation} %]]></script>
<p>where <script type="math/tex">h</script> may be identified with a unicycle controller <script type="math/tex">\pi_{\vartheta}</script> and represents the riders acceleration of the wheel as a reaction to the instantaneous angular momenta. By defining <script type="math/tex">\alpha = \frac{g}{l}</script> we have the simpler
equation:</p>
<script type="math/tex; mode=display">\begin{equation}
\ddot{\theta} = \alpha(\sin(\theta)-h\cos(\theta))
\end{equation}</script>
<p>Now, assuming that the rider starts approximately upright(i.e. <script type="math/tex">\theta \approx 0</script>) we may linearise <script type="math/tex">(11)</script> so we have:</p>
<script type="math/tex; mode=display">\begin{equation}
\ddot{\theta} = \alpha(\theta-h)
\end{equation}</script>
<p>and there exist relatively simple stable solutions for this linearised equation such as:</p>
<script type="math/tex; mode=display">\begin{equation}
h(\theta) = a\theta, a > 1
\end{equation}</script>
<p>but a more sophisticated unicyclist would anticipate the future consequences of the rate <script type="math/tex">\dot{\theta}</script> as well as react to the angle of the fall <script type="math/tex">\theta</script>. This way, he/she may be able to overcome the effects of a finite reaction time. For this reason, I defined the state of the unicycle system to be:</p>
<script type="math/tex; mode=display">\begin{equation}
s_t = \begin{bmatrix}
\theta_t \\
\dot{\theta}_t \\
\end{bmatrix}
\end{equation}</script>
<h3 id="the-simulator">The simulator:</h3>
<p>In order to generate rollouts with a policy <script type="math/tex">\pi_{\vartheta}</script> we must choose a method for numerically integrating <script type="math/tex">(12)</script> and after some reflection I opted for
Velocity Verlet due to its simplicity and numerical stability properties. Assuming that we have evaluated <script type="math/tex">\ddot{\theta}</script> and fixed a reasonable value for <script type="math/tex">\Delta t</script> we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\theta_{i+1} & = \theta_i + \dot{\theta}\Delta t + \frac{1}{2}\ddot{\theta}_i \Delta t\\
\dot{\theta}_{i+1} & = \dot{\theta}_i + \frac{1}{2}(\ddot{\theta}_i + \ddot{\theta}_{i+1}) \Delta t
\end{split}
\end{equation} %]]></script>
<p>Doing this for both <script type="math/tex">\theta</script> and <script type="math/tex">z</script> in TensorFlow, I defined the following Velocity Verlet function:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">velocity_verlet</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
Update the unicycle system using Velocity Verlet integration.
"""</span>
<span class="c">## update rules for theta:</span>
<span class="n">dd_theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="o">*</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="p">)</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">action</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="p">))</span>
<span class="n">theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">theta</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span><span class="p">)</span>
<span class="n">d_theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">dd_theta</span><span class="o">+</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span>
<span class="c">## update rules for z:</span>
<span class="n">dd_z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">g</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">action</span>
<span class="n">z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">z</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_z</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span><span class="p">)</span>
<span class="n">d_z</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">d_z</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">dd_z</span><span class="o">+</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">time</span>
<span class="n">step</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">dd_theta</span><span class="p">),</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">theta</span><span class="p">),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">d_theta</span><span class="p">),</span><span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">dd_z</span><span class="p">),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">z</span><span class="p">),</span><span class="bp">self</span><span class="o">.</span><span class="n">d_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">d_z</span><span class="p">))</span>
<span class="k">return</span> <span class="n">step</span>
</code></pre></div></div>
<p>Now, in order to encourage the policy network to discover controllers in the linear domain of the unicycle system, i.e. <script type="math/tex">(12)</script>,
I initialised the variables at the beginning of each rollout in the following manner:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">restart</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
A simple method for restarting the system where the angle is taken to be a slight
deviation from the ideal value of theta = 0.0 and the system has an important
initial horizontal acceleration.
"""</span>
<span class="n">step</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span><span class="n">stddev</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dd_theta</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">dd_z</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">random_normal</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span><span class="n">mean</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span><span class="n">stddev</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">step</span>
</code></pre></div></div>
<p>The key part in this snippet of code is that <script type="math/tex">\theta_0 \sim \mathcal{N}(0,0.1)</script>.</p>
<h3 id="the-policy-gradients-model">The Policy Gradients model:</h3>
<p>For the unicycle controller(i.e. the policy <script type="math/tex">\pi_{\vartheta}</script>) I defined a conditional Gaussian as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">two_layer_net</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">w_h</span><span class="p">,</span> <span class="n">w_h2</span><span class="p">,</span> <span class="n">w_o</span><span class="p">,</span><span class="n">bias_1</span><span class="p">,</span> <span class="n">bias_2</span><span class="p">):</span>
<span class="s">"""
A generic method for creating two-layer networks
input: weights
output: neural network
"""</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">elu</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">w_h</span><span class="p">),</span><span class="n">bias_1</span><span class="p">))</span>
<span class="n">h2</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">elu</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">w_h2</span><span class="p">),</span><span class="n">bias_2</span><span class="p">))</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h2</span><span class="p">,</span> <span class="n">w_o</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">controller</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
The policy gradient model is a neural network that
parametrises a conditional Gaussian.
input: state(i.e. angular momenta)
output: action to be taken i.e. appropriate horizontal acceleration
"""</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">variable_scope</span><span class="p">(</span><span class="s">"policy_net"</span><span class="p">):</span>
<span class="n">tf</span><span class="o">.</span><span class="n">set_random_seed</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">seed</span><span class="p">)</span>
<span class="n">W_h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span><span class="mi">100</span><span class="p">],</span><span class="s">"W_h"</span><span class="p">)</span>
<span class="n">W_h2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">100</span><span class="p">,</span><span class="mi">50</span><span class="p">],</span><span class="s">"W_h2"</span><span class="p">)</span>
<span class="n">W_o</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">50</span><span class="p">,</span><span class="mi">10</span><span class="p">],</span><span class="s">"W_o"</span><span class="p">)</span>
<span class="c"># define bias terms:</span>
<span class="n">bias_1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">100</span><span class="p">],</span><span class="s">"bias_1"</span><span class="p">)</span>
<span class="n">bias_2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">50</span><span class="p">],</span><span class="s">"bias_2"</span><span class="p">)</span>
<span class="n">eta_net</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">two_layer_net</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">pv</span><span class="p">,</span><span class="n">W_h</span><span class="p">,</span> <span class="n">W_h2</span><span class="p">,</span> <span class="n">W_o</span><span class="p">,</span><span class="n">bias_1</span><span class="p">,</span><span class="n">bias_2</span><span class="p">)</span>
<span class="n">W_mu</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">10</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="s">"W_mu"</span><span class="p">)</span>
<span class="n">W_sigma</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">init_weights</span><span class="p">([</span><span class="mi">10</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span><span class="s">"W_sigma"</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">mu</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">eta_net</span><span class="p">,</span><span class="n">W_mu</span><span class="p">)),</span><span class="bp">self</span><span class="o">.</span><span class="n">action_bound</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">eta_net</span><span class="p">,</span><span class="n">W_sigma</span><span class="p">)),</span><span class="bp">self</span><span class="o">.</span><span class="n">variance_bound</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span>
</code></pre></div></div>
<p>and sampling from this Gaussian is as simple as doing:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sample_action</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
Samples an action from the stochastic controller which happens
to be a conditional Gaussian.
"""</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">distributions</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">,</span><span class="n">tf</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span><span class="p">))</span>
<span class="k">return</span> <span class="n">dist</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
</code></pre></div></div>
<p>Likewise, you may check my <a href="https://github.com/pauli-space/RL_unicycle_control/blob/master/vanilla/vanilla_pg.py">Policy Gradients class</a> to check
out how I calculated the baseline.</p>
<h3 id="defining-rewards">Defining rewards:</h3>
<p>Defining the reward might be the most tricky part of this experiment as it’s not obvious how the unicycle controller should be rewarded and from a dynamical
systems perspective a reward doesn’t really make sense. Either the unicycle has a stable controller or it doesn’t. But, in order to adhere to the Policy Gradients
formalism I opted for the instantaneous height of the unicycle as the reward. This is as simple as:</p>
<script type="math/tex; mode=display">\begin{equation}
\text{height} = l\sin(\theta)
\end{equation}</script>
<p>and as a result the REINFORCE loss(i.e. no baseline subtracted) is simply:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">reinforce_loss</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
The REINFORCE loss without subtracting a baseline.
"""</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">distributions</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">log_sigma</span><span class="p">))</span>
<span class="k">return</span> <span class="n">dist</span><span class="o">.</span><span class="n">log_prob</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">mu</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">height</span>
</code></pre></div></div>
<h3 id="results">Results:</h3>
<p>After defining reasonable hyperparameters(<script type="math/tex">\Delta t = 0.01</script>, rollouts per batch = 10, horizon = 30, total epochs=100,…) I ran a few tests in the <a href="https://github.com/pauli-space/RL_unicycle_control/blob/master/vanilla/does_it_work.ipynb">following
notebook</a> and found that the learned controller was
very good at bringing the unicycle to maximum height but terrible at keeping the unicycle at that height once it got there. In a word, it was very unstable.</p>
<p>Actually, if we evaluate the model <script type="math/tex">\pi_{\vartheta}</script> with <script type="math/tex">\dot{\theta}=0</script> we find that the unicycle exhibits highly nonlinear behaviour in the neighborhood
of <script type="math/tex">\theta = -\frac{\pi}{2}</script></p>
<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/phase_transition.png" align="middle" /></center>
<p>So the force is negative for angles greater than <script type="math/tex">-\frac{\pi}{2}</script> and positive for angles less than <script type="math/tex">-\frac{\pi}{2}</script>. Given that this is many standard deviations away
from <script type="math/tex">\theta = 0</script>, this is most likely due to my choice of tanh activation for the final layer of the policy network.</p>
<p>Furthermore, if we analyse the learned controller behaviour as a function of the full state(i.e. <script type="math/tex">\theta</script> and <script type="math/tex">\dot{\theta}</script>) we observe the following:</p>
<center><img src="https://raw.githubusercontent.com/pauli-space/RL_unicycle_control/master/images/nonlinear_model.png" align="middle" /></center>
<p>The combination of the short-sighted policy and the tanh nonlinearity makes me wonder whether there are small tweaks to my TensorFlow functions which may lead to much
better results.</p>
<h2 id="discussion">Discussion:</h2>
<p>While I think the results are interesting and may be improved from a technical perspective, I don’t think any particular reward function makes sense for learning re-usable locomotion behaviours. Ideally, the agent would be able to propose and learn curriculums of locomotion behaviours in an unsupervised manner in order to learn models of its affordances and its intrinsic locomotory options. One promising approach to this was articulated by Sébastien Forestier, Yann Mollard and Pierre Oudeyer in [6] but I’ll have to think about how Intrinsically Motivated Goal Exploration can be assimilated within the Policy Gradients framework.</p>
<h1 id="references">References:</h1>
<ol>
<li>Passive Dynamic Walking. T. McGeer. 1990.</li>
<li>Emergence of Locomotion Behaviours in Rich Environments. Nicolas Heess et al. 2017.</li>
<li>Policy Gradients for Reinforcement Learning with Function Approximation. Richard S. Sutton, David McAllester, Satinder Singh & Yishay Mansour. 1999.</li>
<li>Unicycles and Bifurcations. R. C. Johnson. 2002.</li>
<li>Modular Multitask Reinforcement Learning with Policy Sketches. Jacob Andreas, Dan Klein, and Sergey Levine. 2017.</li>
<li>Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning. Sébastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. 2017.</li>
</ol>
<h2 id="footnotes">Footnotes:</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>In my experience it always helps to work on tractable and conceptually interesting variants of complex problems(ex. bipedal locomotion) as these problems often provide deep insights into the complex problem of interest. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>I’d be interested to see how optimal control theory may be used for curriculum learning. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aidan RockeUnderstanding the free energy principle2018-04-12T00:00:00+00:002018-04-12T00:00:00+00:00/active/inference/2018/04/12/free_energy<h2 id="motivation">Motivation:</h2>
<p>My general interest in single-motivation theories stems from the belief that a common ancestor for all multi-cellular organisms might imply
common principles of intelligent behaviour. It’s a somewhat reductive hypothesis and as I argued last week, <a href="http://paulispace.com/statistics/2018/04/07/causal_path_entropy.html">some of these theories might be
too reductive</a>, but I think it’s a useful working hypothesis that can take
behavioural scientists very far<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. However, until recently I wasn’t properly acquainted with the free energy principle which, from a distance,
appears to be one of the more plausible single-motivation theories.</p>
<p>The free energy principle is a theory developed by Karl Friston and others to explain how biological systems tend to avoid disorder by limiting themselves
to a small number of favorable states. It comes across as a rather abstract mathematical theory but thanks to a <a href="http://romainbrette.fr/what-is-computational-neuroscience-xxix-the-free-energy-principle/">critical thought experiment</a> proposed by <a href="https://twitter.com/RomainBrette">Romain Brette</a> I found an opportunity to take
a closer look at this theory. In fact, I promised Brette that I would run a computer simulation demonstrating that his thought experiment rests upon flawed assumptions(<a href="https://github.com/pauli-space/Free_Energy_experiments">code here</a>).</p>
<p>In this context, the goal of this blog post is to explain the main idea of the free energy principle and dissect Romain Brette’s thought experiment
in order to develop a practical understanding of this theory.</p>
<h2 id="the-free-energy-principle">The Free Energy Principle:</h2>
<p>In [1], Karl Friston proposes that the Free Energy principle may be a rough guide to the brain and makes the following points:</p>
<ol>
<li>The free energy principle basically applies to any biological system that resists a tendency to disorder.</li>
<li>The free energy principle rests upon the fact that self-organising biological systems resist a tendency to disorder and therefore minimise entropy
of their sensory states.</li>
<li>Assuming that <script type="math/tex">m</script> corresponds to a generative model describing the biological system and <script type="math/tex">y</script> refers to the system’s sensory states, under ergodic assumptions, the entropy is:</li>
</ol>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation*}
\begin{split}
H(y) & = -\int P(y|m) \ln P(y|m) \,dy \\
& = \lim_{T \to \infty} \int_{0}^{T} - \ln P(y(t)|m) \,dt
\end{split}
\end{equation*} %]]></script>
<p>Now, given that entropy is the long-term average of surprise(think of a monte carlo simulation), agents must avoid surprising states where surprise is defined
relative to homeostatic conditions of that particular organism<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>The three points above are sufficient to understand Romain Brette’s thought experiment though I must emphasise that surprisal here is defined in terms of the
agent’s homeostatic conditions so minimisation of surprisal corresponds to both minimisation of epistemic uncertainty(i.e. unknown unkowns) as well as
statistical uncertainty(i.e. known unknowns).</p>
<h2 id="romain-brettes-thought-experiment">Romain Brette’s thought experiment:</h2>
<p>In Brette’s article, he summarises the free energy principle in the following manner:</p>
<blockquote>
<p>The free energy principle is the theory that the brain manipulates a probabilistic generative model of its sensory inputs,
which it tries to optimise by either changing the model(learning) or changing the inputs(by acting).</p>
</blockquote>
<p>Although I haven’t mentioned anything about the human brain so far, this is a relatively good summary, and Brette proceeds with
the following food vs. no food thought experiment:</p>
<ol>
<li>An agent has two kinds of observations/stimuli: food and the absence of food.</li>
<li>This agent has two possible actions: seek food or don’t seek food.</li>
<li>When the agent seeks food there’s a 20% probability of getting food.</li>
<li>When the agent doesn’t seek food there’s a 100% probability of getting no food.</li>
</ol>
<p>What should a surprise minimising agent do? Romain presents the following argument:</p>
<blockquote>
<p>What does the free energy principle tell us? To minimize surprise, it seems clear that I should sit: I am certain to not see food. No surprise at all. The proposed solution is that you have a prior expectation to see food. So to minimize the surprise, you should put yourself into a situation where you might see food, ie to seek food. This seems to work. However, if there is any learning at all, then you will quickly observe that the probability of seeing food is actually 20%, and your expectations should be adjusted accordingly. Also, I will also observe that between two food expeditions, the probability to see food is 0%. Once this has been observed, surprise is minimal when I do not seek food. So, I die of hunger. It follows that the free energy principle does not survive Darwinian competition.</p>
</blockquote>
<p>Basically, Romain argues that surprise is minimal when the organism doesn’t seek food assuming that Friston’s definition of surprisal corresponds to minimisation of
statistical uncertainty. Given that Friston’s surprisal is defined in terms of the agent’s homeostatic conditions, this assumption is precisely where Romain’s analysis
breaks down. It also helps to simulate such toy problems on a computer, if possible, because in a simulation you have to make every modelling assumption clear.</p>
<h2 id="a-reasonable-model-of-brettes-problem">A reasonable model of Brette’s problem:</h2>
<center><img src="https://raw.githubusercontent.com/pauli-space/Free_Energy_experiments/master/diagram.png" align="middle" /></center>
<p>To simulate Romain’s problem, I made the following assumptions:</p>
<ol>
<li>We have an organism which has to eat <script type="math/tex">k</script> times on average in the last 24 hours and can eat at most once per hour.</li>
<li>The homeostatic conditions of our organism are given by a Gaussian distribution centered at <script type="math/tex">k</script> with unit variance, a Gaussian food critic if you will. This specifies that our organism should’t eat much less than <script type="math/tex">k</script> times a day and shouldn’t eat a lot more than <script type="math/tex">k</script> times a day. In fact, this explains why living organisms tend to
have masses that are normally distributed during adulthood.</li>
<li>A food policy consists of a 24-dimensional vector where the values range from 0.0 to 1.0 and we want to maximise the negative log probability that the total consumption is drawn from the Gaussian food critic.</li>
<li>Food policies are the output of a generative neural network(setup using TensorFlow) whose inputs are either one or zero to indicate a survival prior, with one indicating a preference for survival.</li>
<li>The backpropagation algorithm, in this case Adagrad [5], functions as a homeostatic regulator by updating the network with variations in the network weights proportional to the negative logarithmic loss(i.e. surprisal).</li>
</ol>
<p>Assuming <script type="math/tex">k=3</script>, I ran a simulation in the <a href="https://github.com/pauli-space/Free_Energy_experiments/blob/master/simulation.ipynb">following notebook</a> and found that the discovered food policy differs significantly from Romain’s expectation that the agent would choose to not look for food in order to minimise surprisal. In fact, our simple agent manages to get three meals per day on average so it survives.</p>
<p>Overall, this is a relatively simple problem with a fixed prior(i.e. fixed belief) as the organism doesn’t have to do more than eat. So I can minimise surprise directly but in general, if we have adjustable beliefs(ex. models of physics and their physical parameters/constants) then we have a much harder problem and that’s where I would need to use the KL-divergence and invoke free energy minimisation, rather than directly minimising surprisal. However, these models and their parameters would still be evaluated with respect to homeostatic constraints. This guarantees that the organism isn’t simply trying to minimise statistical uncertainty.</p>
<h2 id="conclusion">Conclusion:</h2>
<p>Until recently, the Free Energy Principle has been a constant source of mockery from neuroscientists who misunderstood it and so I hope that by growing a collection
of <a href="https://github.com/pauli-space/Free_Energy_experiments">free-energy motivated reinforcement learning examples on Github</a> we may finally have a constructive discussion
between scientists. Moreover, I have been asked whether it’s not immodest for Karl Friston to suggest that his theory might be a model for human behaviour. Well, my answer
to that question is the same answer I would give to the critics of Empowerment[7].</p>
<p>Let’s see how far ingenious implementations(i.e. experiments) using these formalisms can take us. That’s the only way we’ll know what the limitations of these
theories are.</p>
<h1 id="references">References:</h1>
<ol>
<li>The free-energy principle: a rough guide to the brain? (K. Friston. 2009.)</li>
<li>The Markov blankets of life: autonomy, active inference and the free energy principle (M. Kirchhoff, T. Parr, E. Palacios, K. Friston and J. Kiverstein. 2018.)</li>
<li>Free-Energy Minimization and the Dark-Room Problem (K. Friston, C. Thornton and A. Clark. 2012.)</li>
<li>What is computational neuroscience? (XXIX) The free energy principle (R. Brette. 2018.)</li>
<li>Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (J. Duchi, )</li>
<li>Empowerment — An Introduction. C. Salge et al. 2013.</li>
<li>Reward, Motivation, and Reinforcement Learning (P. Dayan and B. Balleine. 2002.)</li>
</ol>
<h1 id="footnotes">Footnotes:</h1>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The notion of utility maximisation in economics, though limited, has been very useful for example. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In [2], homeostatic conditions of an organism are defined in terms of Markov Blankets which are equivalent to the boundaries of a system in a statistical sense. I would encourage the reader to go into that paper after going through this blog post but this concept isn’t essential for understanding Romain’s thought experiment, so we’ll ignore this formalism for now. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aidan RockeMotivation:Fractals with TensorFlow2018-03-25T00:00:00+00:002018-03-25T00:00:00+00:00/tensorflow/2018/03/25/tensorflow_fractals<center><img src="https://i.stack.imgur.com/f4ned.png" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>Last week, it occurred to me to experiment with Mandelbrot sequences with variable exponents and after
a few experiments using <a href="https://github.com/AidanRocke/TensorFlow-Fractals">TensorFlow-Fractals</a> I made
a couple <a href="https://math.stackexchange.com/questions/2705107/symmetries-of-mandelbrot-sets-with-integer-exponents">mathematical observations</a>
which surprised me a little. My principal interest in fractals besides mathematical beauty is that their
massively parallel nature makes them a good benchmark for GPUs. In fact, one of my projects in the near
future will be to simulate Quaternion fractals on GPUs with TensorFlow [2].</p>
<p>Before continuing, I must say that from a mathematical perspective everything here is rather naive
but my philosophy is that it’s always better to get started and add more layers of sophistication later.</p>
<h2 id="the-mandelbrot-sequence">The Mandelbrot sequence:</h2>
<p>Mandelbrot sets are defined in terms of the following quadratic sequence in the complex plane:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = z_n^2 + c \\
c = z_0 \in \mathbb{C}
\end{cases}
\end{equation}</script>
<p>Using this sequence, the Mandelbrot is normally defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
M = \{z_0 \in \mathbb{C}: \lim_{n \to \infty} |z_n| < \infty \}
\end{equation} %]]></script>
<p>Now, given that <script type="math/tex">z_n</script> might be an oscillating sequence we need to resort to a few approximations in order to simulate Mandelbrot sets
on a computer. Here’s a short list:</p>
<ol>
<li>Finite precision.</li>
<li>Stopping criteria for divergence.</li>
<li>Stopping criteria for the number of iterates.</li>
</ol>
<p>To address these issues we use 32-bit floating point numbers, a pre-defined upper-bound on the modulus of <script type="math/tex">z_n</script> and a limit on the number of iterations.
With an upper-bound of 7.0 and a limit of 500 iterations, the reader should obtain an image similar to the following figure:</p>
<center><img src="https://i.stack.imgur.com/MRDvL.png" align="middle" style="width:600px;height:600px;" /></center>
<p>This is as much as I will say about Mandelbrot sets although if the reader is interested in learning more, I highly recommend the
<a href="http://mathworld.wolfram.com/MandelbrotSet.html">primer on Wolfram MathWorld</a>.</p>
<h2 id="generalised-mandelbrot-sequences">Generalised Mandelbrot sequences:</h2>
<p>Things became interesting when I experimented with recursive equations of the form:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = z_n^\alpha + c \\
c = z_0 \in \mathbb{C}, \alpha \in \mathbb{Z}
\end{cases}
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
z_{n+1} = \overline{z_n}^\alpha + c \\
c = z_0 \in \mathbb{C}, \alpha \in \mathbb{Z}
\end{cases}
\end{equation}</script>
<p>Using equation <script type="math/tex">(3)</script> I obtained the following images for exponents of <script type="math/tex">-2.0</script> and <script type="math/tex">-4.0</script>:</p>
<center><img src="https://i.stack.imgur.com/f4ned.png" align="middle" style="width:600px;height:600px;" /></center>
<p><br /></p>
<center><img src="https://i.stack.imgur.com/8FSK0.png" align="middle" style="width:600px;height:600px;" /></center>
<p>In fact, I made the following observations:</p>
<ol>
<li>Using equation <script type="math/tex">(3)</script>, the resulting structure has <script type="math/tex">\alpha-1</script> symmetries when <script type="math/tex">\alpha \geq 2</script> and <script type="math/tex">\lvert \alpha \rvert +1</script> symmetries when <script type="math/tex">\alpha \leq -2</script>.</li>
<li>Using equation <script type="math/tex">(4)</script>, the resulting structure has <script type="math/tex">\alpha+1</script> symmetries when <script type="math/tex">\alpha \geq 2</script> and <script type="math/tex">\lvert \alpha \rvert-1</script> symmetries when <script type="math/tex">\alpha \leq -2</script>.</li>
</ol>
<p>So far I don’t have a good explanation for these results but I hope to discover the reason behind the symmetries of these fractal structures
before the end of next week.</p>
<h2 id="whats-next">What’s next:</h2>
<p>Before investigating Quaternion Mandelbrot sets on GPUs, I would like to take a closer look at the following questions:</p>
<ol>
<li>Numerical stability as a function of <script type="math/tex">\alpha</script> and <script type="math/tex">z_0</script>.</li>
<li>Might there be better stopping criteria besides hard-coded bounds on the modulus of <script type="math/tex">z_n</script> and the maximum number of iterates.</li>
<li>Is the Mandelbrot set computable? (Note: this has been <a href="https://cs.stackexchange.com/questions/42685/in-what-sense-is-the-mandelbrot-set-computable">discussed on the CS stackexchange</a>.)</li>
</ol>
<p>These questions don’t quite fall under the category of intelligent behaviour but who knows? On the one hand, the Universe might just be a set of simple rules which are applied in a recursive manner. On the other hand, fractals provide researchers with an effective(and beautiful) way of benchmarking hardware and software performance.</p>
<p>Either way, the moral of the story is that playing with Mandelbrot sets is always an opportunity to learn something new about computation.</p>
<h1 id="references">References:</h1>
<ol>
<li>Fractal Art Generation using GPUs. Mayfield et al. 2016.</li>
<li>Ray Tracing Quaternion Julia Sets on the GPU. Keenan Crane. 2005.</li>
<li>Non-computable Julia Sets. M. Braverman, M. Yampolsky. 2005.</li>
</ol>Aidan RockeNormal approximation to uniform distribution2018-03-13T00:00:00+00:002018-03-13T00:00:00+00:00/statistics/2018/03/13/normal_approximation<h2 id="motivation">Motivation:</h2>
<p>Earlier today I was talking to a researcher about how well a normal distribution could approximate a uniform distribution
over an interval <script type="math/tex">[a,b] \subset \mathbb{R}</script>. I gave a few arguments for why I thought a normal distribution wouldn’t be good
but I didn’t have the exact answer at the top of my head so I decided to find out. Although the following analysis involves
nothing fancy I consider it useful as it’s easily generalised to higher dimensions(i.e. multivariate uniform distributions)
and we arrive at a result which I wouldn’t consider intuitive.</p>
<p>For those who appreciate numerical experiments, I wrote a small TensorFlow script to accompany this blog post.</p>
<h2 id="statement-of-the-problem">Statement of the problem:</h2>
<p>We would like to minimise the KL-Divergence:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{D}_{KL}(P|Q) = -\int_{-\infty}^\infty p(x) \ln \frac{p(x)}{q(x)}dx
\end{equation}</script>
<p>where <script type="math/tex">P</script> is the target uniform distribution and <script type="math/tex">Q</script> is the approximating Gaussian:</p>
<script type="math/tex; mode=display">\begin{equation}
p(x)= \frac{1}{b-a} \mathbb{1}_{[b-a]} \implies p(x \notin [b-a]) = 0
\end{equation}</script>
<p>and</p>
<script type="math/tex; mode=display">\begin{equation}
q(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x-\mu)^2}{2 \sigma^2}}
\end{equation}</script>
<p>Now, given that <script type="math/tex">\lim_{x \to 0} x\ln(x) = 0</script> if we assume that <script type="math/tex">(a,b)</script> is fixed our loss may be expressed in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\mathcal{L}(\mu,\sigma) & = -\int_{a}^b p(x) \ln \frac{p(x)}{q(x)}dx \\
& = \ln(b-a) - \frac{1}{2}\ln(2\pi\sigma^2)-\frac{\frac{1}{3}(b^3-a^3)-\mu(b^2-a^2)+\mu^2(b-a)}{2\sigma^2(b-a)} \end{split}
\end{equation} %]]></script>
<h2 id="minimising-with-respect-to-mu-and-sigma">Minimising with respect to <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>:</h2>
<p>We can easily show that the mean and variance of the Gaussian which minimises <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> correspond to the
mean and variance of a uniform distribution over <script type="math/tex">[a,b]</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \mu} \mathcal{L}(\mu,\sigma) = \frac{(b+a)}{2\sigma^2} - \frac{2\mu}{2\sigma^2}= 0 \implies \mu = \frac{a+b}{2}
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
\frac{\partial}{\partial \sigma} \mathcal{L}(\mu,\sigma) = -\frac{1}{\sigma}+\frac{\frac{1}{3}(b^2+a^2+ab)-\frac{1}{4}(b+a)^2}{\sigma^3} =0 \implies \sigma^2 = \frac{(b-a)^2}{12}
\end{equation}</script>
<p>Although I wouldn’t have guessed this result the careful reader will notice that this result readily generalises to higher dimensions.</p>
<h2 id="analysing-the-loss-with-respect-to-optimal-gaussians">Analysing the loss with respect to optimal Gaussians:</h2>
<p>After entering the optimal values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> into <script type="math/tex">\mathcal{L}(\mu,\sigma)</script> and simplifying the resulting expression we have
the following residual loss:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathcal{L}^* = -\frac{1}{2}(\ln \big(\frac{\pi}{6}\big)+1) \approx -.17
\end{equation}</script>
<p>I find this result surprising because I didn’t expect the dependence on <script type="math/tex">\Delta = b-a</script> to vanish. That said, my current intuition for this result
is that if we tried fitting <script type="math/tex">\mathcal{U}(a,b)</script> to <script type="math/tex">\mathcal{N}(\mu,\sigma)</script> we would obtain:</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{cases}
a = \mu - \sqrt{3}\sigma \\
b = \mu + \sqrt{3}\sigma
\end{cases}
\end{equation}</script>
<p>so this minimisation problem corresponds to a linear re-scaling of the uniform parameters in terms of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>.</p>
<h2 id="remark">Remark:</h2>
<p>The reader may experiment with <a href="https://gist.github.com/AidanRocke/0a3ff41c8421a974640742d57bee8b71">the following TensorFlow function</a> which outputs
the approximating mean and variance of a Gaussian given a uniform distribution on the interval <script type="math/tex">[a,b]</script>.</p>Aidan RockeMotivation:What is the role of logic in Mathematics?2017-11-27T00:00:00+00:002017-11-27T00:00:00+00:00/mathematics/2017/11/27/platonic_math<h2 id="introduction">Introduction:</h2>
<p>The orthodox belief among pure mathematicians is that the foundations of mathematics are grounded in a few sacred axioms
and set theory where logic naturally has a central role in its development. However, by means of a simple thought experiment
I show that curiosity, more than logic, is essential for the development of mathematics. Moreover, I argue that
curiosity is firmly grounded in both our sensorimotor experience and the tools we use for doing mathematics.</p>
<p>This leads to a holistic account of the foundations of mathematics which challenges the Platonic notion that
‘pure’ mathematics is discovered and makes the case that the envelope of potential mathematical
discoveries is parametrised by both human morphology and technologies for doing mathematics. Crucially, this ‘Cyborg’ view
of mathematics has important implications for investigations on the foundations of mathematics as well as the manner
mathematics is taught at the university level.</p>
<h2 id="the-role-of-logic-in-mathematics">The role of logic in mathematics:</h2>
<p>While the importance of axiomatics and set theory in structuring mathematics is undeniable, I think we should not lose sight
of what logic actually provides:</p>
<ol>
<li>A system for verifying our discoveries to an axiomatic level of detail.</li>
<li>A method for communicating our mathematical discoveries in a convincing manner.</li>
</ol>
<p>In truth, the second argument has much greater weight than the first since an important consequence of Gödel’s incompleteness
theorems is that logic doesn’t guarantee the permanence of our mathematical discoveries. Furthermore, very few mathematicians
use formal proof assistants like Coq or Isabelle to write their mathematical proofs although proof assistants are practically
essential for verification at an axiomatic level of detail. How can we explain this?</p>
<p>Like all humans, mathematicians pursue rigor only to the extent that its cost justifies the reward. That said, if logical verification
isn’t essential to mathematics what could possibly be the vital force behind its development?</p>
<h2 id="the-importance-of-curiosity">The importance of curiosity:</h2>
<p>While I would grant that logical verification is important for problem solving in mathematics, if mathematics was reducible to
problem solving we would have no more than one mathematical question to answer(ex. 2+2=?) and there wouldn’t have been a field
of mathematics. In other words, there has to be some intrinsic motivation in all mathematicians which drives them to not only
solve problems but also seek out problems to solve. From this it follows that intrinsic motivation(or curiosity) has a much greater
role than logic in explaining why there are multiple branches of mathematics. In fact, this implies that curiosity not logic has to
be the vital force which guides its development.</p>
<p>Such a line of reasoning is especially relevant to investigations on the foundations of mathematics as it immediately raises doubts
on the platonic account of mathematics. This however raises important epistemological questions concerning the nature of curiosity.</p>
<h2 id="the-origin-and-development-of-mathematics">The origin and development of mathematics:</h2>
<p>In [2], Poincaré famously argues that primitive mathematical notions like size, continuity and number have imprecise perceptual origins. A child can learn to tell the difference in size between a big dog and a small dog without having to first learn about the greater than relation. Such perceptual faculties effectively serve as good priors for learning mathematics, a task which would be considerably harder otherwise. In addition, there is a wide range of scientific evidence presented in [1] demonstrating that-besides being the origin of our mathematical knowledge-our sensorimotor experience is an essential guide in our mathematical development. This means that our curiosity is constrained by both our morphology and the tools we use for doing mathematics.</p>
<p>While mathematical reasoning often conforms to mathematical principles, it is typically implemented in a sensorimotor loop which includes a device for data-input(ex. pen/pencil) and material for data-storage(ex. paper). In this context, the authors of [1] advance a Cyborg view of mathematics:</p>
<blockquote>
<p>…the active manipulation of physical notations plays the role of ‘guiding’ the biological machinery through an abstract mathematical problem space-one that may exceed the space of otherwise solveable problems.</p>
</blockquote>
<p>Although many mathematicians might contest this, I wonder whether any mathematician can do advanced mathematics without pen and paper, or a functional substitute. We must also acknowledge the increasingly important role of the computer for doing research-level mathematics.</p>
<p>In addition, we must note a more subtle but equally significant technology; mathematical notation has evolved over time by a process which isn’t arbitrary. While the space of satisfactory mathematical notations might be large, most randomly generated notations are bad for doing mathematics which is why mathematicians define <a href="https://mathoverflow.net/questions/42929/suggestions-for-good-notation">rules of thumb for good notation</a>. The triumph of Leibniz notation over Newton’s notation is a concrete example of this. Moreover, Terrence Tao once wrote a full <a href="https://terrytao.wordpress.com/advice-on-writing-papers/use-good-notation/">blog post</a> on this issue which includes the following quote due to Alfred North Whitehead:</p>
<blockquote>
<p>By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental power of the race.</p>
</blockquote>
<p>Yet, this statement flies in the face of Cognitive Science orthodoxy as stated in [1]:</p>
<blockquote>
<p>Cognitive scientists have traditionally viewed this capacity-the capacity for symbolic reasoning-as grounded in the ability to internally represent numbers, logical relationships, and mathematical rules in an abstract, amodal fashion.</p>
</blockquote>
<p>Clearly, this line of reasoning is absurd. If anything both scientific and empirical evidence strongly indicates that our sensorimotor experience is an essential substrate for mathematical thought and not merely a translational medium. When combined with the importance of curiosity it follows that we
have to encourage individual experimentation with technologies aiding mathematical activity in order to maximise the collective human potential for
mathematical discovery.</p>
<h2 id="conclusion">Conclusion:</h2>
<p>Having laid out these arguments, I think it’s clear that the Cyborg view of mathematics provides more stable foundations for mathematics than the orthodox view which is not only scientifically and empirically baseless, but also diminishes our collective potential for mathematical discovery. In particular, I would like to point out a few key innovations in the Cyborg tradition which have yet to be fully appreciated at the university level.</p>
<p>The first is the use of online blogs for communicating mathematical ideas as written homework/projects can be very isolating rather than engaging. You generally get very little feedback even if you do get a good mark which trivialises the activity. Second, is the creation of <a href="https://gowers.wordpress.com/2009/01/27/is-massively-collaborative-mathematics-possible/">Polymath projects</a> for exploring the role of large-scale self-organizing collaboration among students. Finally, I think mathematicians of all levels of ability can benefit from using <a href="http://jupyter.org/">Jupyter notebooks</a> for interactive experimental mathematics as I have whenever investigating problems in combinatorics or probability.</p>
<p>In my opinion, these innovations indicate yet-unrealised potential. Indeed, I believe that if the majority of mathematicians transition towards a Cyborg perspective of mathematical foundations, we shall witness a much more creative period of mathematics.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>A perceptual account of symbolic reasoning (David Landy, Colin Allen & Carlos Zednik. 2014. frontiers in Psychology.)</p>
</li>
<li>
<p>La Science et L’Hypothèse (Henri Poincaré. 2014. Champs Sciences.)</p>
</li>
</ol>Aidan RockeIntroduction:The theoretical limitations of DQN2017-08-29T00:00:00+00:002017-08-29T00:00:00+00:00/inference/2017/08/29/dqn<center><img src="https://raw.githubusercontent.com/pauli-space/pauli-space.github.io/master/_images/dqn.jpg" align="middle" /></center>
<h1 id="introduction">Introduction:</h1>
<p>Less than three years after the publication of Deep Mind’s publication ‘Playing Atari with Deep Reinforcement Learning’
the practical impact of this method on RL literature has been profound, as evidenced by the above graphic. However, the
theoretical limitations of the original method haven’t been thoroughly investigated. As I will show, such an analysis
actually clarifies the evolution of DQN and highlights which research directions are worth prioritising.</p>
<h1 id="background-on-dqn">Background on DQN:</h1>
<p>The main idea behind Deep Q-learning, hereafter referred to as DQN, is that given actions <script type="math/tex">a \in \mathcal{A}</script> and states <script type="math/tex">x \in X</script> in a Markov
Decision Process(MDP), it’s sufficient to optimise action selection with respect to the expected return:</p>
<script type="math/tex; mode=display">\begin{equation}
Q_{\pi}(x,a) = \mathbb{E} [\sum_{t=0}^{\infty} \gamma^t R(x_t,a_t)], \gamma \in (0,1)
\end{equation}</script>
<p>In particular the aim is to approximate a parametrised value function <script type="math/tex">Q(x,a;\theta_t)</script> where estimation is shifted towards the target:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta_t)
\end{equation}</script>
<p>and gradient descent updates are done as follows:</p>
<script type="math/tex; mode=display">\begin{equation}
\theta_{t+1} = \theta_t + \alpha(Y_t^Q-Q(S_t,A_t;\theta_t)) \nabla_{\theta} Q(S_t,A_t;\theta_t)
\end{equation}</script>
<p>In addition, epsilon-greedy approaches are used for exploration and to avoid estimates that merely reflect
recent experience the authors of DQN regularly allow the network to perform experience replay: batch updates
based on less recent experience.</p>
<p>Given the above description of DQN, we may note the following:</p>
<ol>
<li>Selection and evaluation in DQN is done with respect to the same parameters <script type="math/tex">\theta_t</script>.</li>
<li>Assuming that variance is unavoidable, the <script type="math/tex">\max</script> operator in (2) leads to over-optimistic estimates.</li>
<li>The expression in (1) provides an asymptotic guarantee which implicitly requires an ergodic MDP.</li>
</ol>
<p>These issues shall be addressed in the sections that follow.</p>
<h1 id="asymptotic-nonsense-or-the-data-inefficiency-of-dqn">Asymptotic nonsense or the data-inefficiency of DQN:</h1>
<p>In the simple case of i.i.d. data <script type="math/tex">X_i</script> if <script type="math/tex">S_n = \sum_{i=1}^{n} X_i</script> and <script type="math/tex">\mathbb{E}[X_i] = \mu</script>, a simple application of Chebyshev’s inequality gives:</p>
<script type="math/tex; mode=display">\begin{equation}
\forall \epsilon > 0, P(|\frac{S_n}{n}-\mu| > \epsilon) \leq \frac{\sigma}{n \epsilon^2}
\end{equation}</script>
<p>Essentially, this inequality shows that even in simple scenarios convergence in expectation requires a lot of data
and the rate of convergence depends on the variance <script type="math/tex">\sigma</script>. Furthermore, we must note that this inequality ignores
the following facts:</p>
<ol>
<li>For fixed <script type="math/tex">(x,a)</script>, <script type="math/tex">Q_{\pi}(x,a)</script> is rarely unimodal in practice.</li>
<li><script type="math/tex">Q_{\pi}(x,a)</script> rarely has negligible variance.</li>
<li>Our data is sequential and hardly ever i.i.d.</li>
</ol>
<p>From these points it follows that important estimation errors are unavoidable but as I will show, this isn’t the main
problem.</p>
<h1 id="the-unreasonable-optimism-of-dqn">The unreasonable optimism of DQN:</h1>
<ol>
<li>
<p>Over-optimism with respect to estimation errors:</p>
<p>The authors in [3] highlight that in (2), evaluation of the target <script type="math/tex">Y_t^Q</script> and action selection are done with respect to
the same parameters <script type="math/tex">\theta_t</script> which over-optimistic value estimates more likely with respect to the <script type="math/tex">\max</script> operator.
This suggests that estimation errors of any kind are more likely to result in overly-optimistic policies.</p>
<p>While this is problematic, the authors of [3] discovered the following elegant solution:</p>
<script type="math/tex; mode=display">\begin{equation}
Y_t^Q = R_{t+1} + \gamma Q(S_{t+1},\max\limits_{a} Q(S_{t+1},a;\theta_{t});\theta'_{t})
\end{equation}</script>
<p>The resulting method, known as Double DQN, essentially decouples selection and evaluation by using two sets of weights <script type="math/tex">\theta</script>
and <script type="math/tex">\theta'</script>.</p>
</li>
<li>
<p>Over-optimism with respect to risk regardless of estimation error:</p>
<p>Consider the classic problem in decision theory of having to choose between an envelope <script type="math/tex">A</script> which contains $90.00 and envelope
<script type="math/tex">B</script> which contains $200.00 or $0.00 with equal probability. Although <script type="math/tex">Var[A] \ll Var[B]</script>, our agent’s
ignorance of the bimodality of <script type="math/tex">B</script> would lead it to act in an over-optimistic fashion. Due to the <script type="math/tex">\max</script> operator
it would make a decision solely based on the fact that <script type="math/tex">\mathbb{E}[B] > \mathbb{E}[A]</script>.</p>
<p>The above problem clearly requires a very different perspective.</p>
</li>
</ol>
<p>Two papers which address the second problem are [5] and [7]. While I won’t go into either paper in any detail I would recommend that the
reader start with [5] which provides an elegant and scalable solution with what can be thought of as a data-dependent
version of dropout [8]. The consideration of value distributions helps reduce uncertainty and improve inference.</p>
<h1 id="the-latent-value-of-hierarchical-models">The latent value of hierarchical models:</h1>
<p>Perhaps the most important question when considering the evolution of DQN is how will these agents develop rich conceptual abstractions
that will allow scientific induction or generalisation. Although one can argue that a DQN learns good statistical representations of
environmental states <script type="math/tex">x</script> it doesn’t learn any higher-order abstractions such as concepts. Moreover, vanilla DQN is purely reactive
and doesn’t incorporate planning in any meaningful sense. This is where Hierarchical Deep Reinforcement Learning can play a very important role.</p>
<p>In particular, I would like to mention the promising work of Tejas Kulkarni who investigated the use of hierarchical DQN, which has the following architecture:</p>
<ol>
<li>Controller: which learns policies in order to satisfy particular goals</li>
<li>Meta-Controller: which chooses goals</li>
<li>Critic: which evaluates whether a goal has been achieved</li>
</ol>
<p>Together these three components cooperate so that a high-level policy is learned over intrinsic goals and a lower-level policy is learned
over ‘atomic’ actions to satisfy the given goals. The work, which I’ve only vaguely described, opens up a lot of interesting
research directions which may not seem immediately obvious. One I’d like to mention is the possibility of learning a
grammar over policies. I think this might be a necessary component for the emergence of language in machines.</p>
<p>The interpretation of the ‘Critic’ is also very interesting. Perhaps one can argue that it provides the agent with a rudimentary form of
introspection.</p>
<h1 id="conclusion">Conclusion:</h1>
<p>I find it remarkable that a simple method such as DQN should inspire many new approaches. Perhaps it’s not so much the brilliance
of the method but rather its generality which allowed this method to adapt and evolve. In particular, I think the coupling
of Distributional RL with Hierarchical Deep RL has a very bright future. Together, this will lead to signficant improvements in terms of inference and generalisation.</p>
<p><strong>Note:</strong> The graphic is taken from [9].</p>
<h1 id="references">References:</h1>
<ol>
<li>C. J. C. H. Watkins, P. Dayan. Q-learning. 1992.</li>
<li>V. Minh, K. Kavukcuoglu, D. Silver et al. Playing Atari with Deep Reinforcement Learning. 2015.</li>
<li>H. van Hasselt ,A. Guez and D. Silver. Deep Reinforcement Learning with Double Q-learning. 2015.</li>
<li>Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions. 2017.</li>
<li>Ian Osband, Charles Blundell, Alexander Pritzel and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN. 2016.</li>
<li>Tejas Kulkarni et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. 2016.</li>
<li>Marc G. Bellemare, Will Dabney and Rémi Munos. A Distributional Perspective on Reinforcement Learning. 2017.</li>
<li>Yarin Gal & Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. 2016.</li>
<li>Niels Justesen, Philip Bontrager, Julian Togelius, Sebastian Risi. Deep Learning for Video Game Playing. 2017.</li>
</ol>Aidan Rocke