## Introduction:

Sergio Hernandez, a Spanish mathematician, recently shared some very interesting results on the OpenAI gym environment which are based on a relatively unknown paper published by Dr. Wissner-Gross, a physicist trained at MIT. What is impressive about Wissner’s meta-heuristic is that it is succinctly described by three equations which try to maximize the future freedom of your agent. In this analysis, I summarize the method, present its strengths and weaknesses, and attempt to improve it by making an important modification to one of the equations.

## Causal entropic forces:

In the following summary of Wissner’s meta-heuristic, it’s assumed that the agent has access to an approximate or exact simulator. A close reading of the original paper [1] will show that this assumption is actually necessary.

### Macrostates:

For any open thermodynamic system, we treat the phase-space paths taken by the system $x(t)$ over the time interval $[0,\tau]$ as microstates and partition them into macrostates $\{ X_i \}_{i \in I}$ using the equivalence relation[1]:

As a result, we can identify each macrostate $X_i$ with a unique present system state $x(0)$. This defines a notion of causality over a time interval.

### Causal path entropy:

We can define the causal path entropy $S_c$ of a macrostate $X_i$ with the associated present system state $x(0)$ as the path integral:

where we have:

In (3) we basically integrate over all possible paths $x^*(t)$ taken by the open system’s environment. In practice, this integral is intractable and we must resort to approximations and the use of a sampling algorithm like Hamiltonian Monte Carlo [3].

### Causal entropic force:

A path-based causal entropic force $F$ may be expressed as:

where $T_c$ and $\tau$ are two free parameters. This force basically brings us closer to macrostates $X_j$ that maximize $S_c (X_i, \tau)$. In essence the combination of equations (2), (3) and (4) maximize the number of future options of our agent. This isn’t very different from what most people try to do in life but this meta-heuristic does have very important limitations.

## Limitations of the Causal Entropic approach:

1. The Causal Entropic paper makes the implicit assumption that we have access to a reliable simulator of future states. In the case of the OpenAI environments this isn’t a problem because environment simulators are provided but in general it’s a hard problem. Two useful approaches to this problem are suggested by [4] and [5] using recurrent neural networks which would basically produce at any instant a distribution over potential future states, assuming a given branching factor $\alpha$ and recursion depth $\beta$.

2. Maximizing your number of future options is not always a good idea. Sometimes fewer options are better provided that these are more useful options. This is why for example, football players don’t always rush to the center of a football pitch, although from that position they would maximize their number of future states i.e. possible positions on the pitch.

In the next section I would like to show that it’s possible to find a practical solution to the second limitation by modifying (3).

## Causal Path Utility:

Assuming that a recurrent neural network is used to define potential macrostates $\{ X_i \}_{i \in I}$, it’s reasonable to assume that our agent’s understanding of the future evolves with time and therefore macrostates are a function of time. So we have $\{ X_i(t) \}_{i \in I}$ rather than $\{ X_i \}_{i \in I}$. In other words, our simulator which might be an RNN, will probably change its parameters and even its topology over time.

In order to resolve the second limitation and encourage the agent to make confident decisions, I propose that we replace $S_c(X, \tau)$ with $U_c(X, \tau)$ where:

Here $U (x(t)\mid x(0))$ defines the normalised relative utility of a state $x(t)$ given $x(0)$, a function that is learned and takes values in $[0,1]$. In practice, we may use a neural network for $U$ and in order to calculate $Var[U(x(t)\mid x(0))]$ we may use monte-carlo dropout as described in [2]. The value of calculating the variance is that it allows us to prioritise subspaces that have high estimated utility with low variance. In this way, we try to make sure that the agent is confident in its decisions.

Alternatively, we may use the variance as a regularisation term as follows:

This not only has the added value of simplifying calculations but also allows us to disentangle the relative contributions of utility and uncertainty. It must also be noted that the two expressions in (6) can be calculated in parallel although the uncertainty calculation is more computationally expensive.

## Discussion:

If we assume that the agent’s perception of the future doesn’t change much, it might perceive some future states to be ideal. This is consistent with the empirical observation that many people believe certain accomplishments would bring them ‘genuine happiness’. In other words, if the state space is compact and approximately time-invariant the agent’s optimal future macrostate converges to a fixed point [6].

While the notion of Causal Path Utility just occurred to me today, I believe that this is a very promising approach which I shall follow-up with concrete implementations very soon.

# References:

1. Causal Entropic Forces (A. D. Wissner-Gross & C.E. Freer. 2013. Physical Review Letters.)

2. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (Yarin Gal & Zoubin Ghahramani. 2016. ICML. )

3. Stochastic Gradient Hamiltonian Monte Carlo ( Tianqi Chen, Emily Fox & Carlos Guestrin. 2014. ICML.)

4. Recurrent Environment Simulators (Silvia Chappa et al. 2017. ICLR.)

5. On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models (J. Schmidhuber. 2015.)

6. Fixed Point Theorems with Applications to Economics and Game Theory (Border, Kim C. 1985. Cambridge University Press.)