Approximately one week ago, I defined a set of experiments in order to model the effects of dropout and unsupervised pre-training on deep rectifier networks. However, prior to running through the experiments I realised that this was an opportunity to develop my own personal research workflow. After more reflection I decided to follow this particular process:

  1. Define experiments: including methodology, experimental setup and working hypotheses
  2. Share preliminary observations: in order for readers to understand where scientific intuitions come from and overcome writer’s block
  3. Experimental analysis: detailed statistical analysis of experimental results including hypothesis testing
  4. Theoretical analysis: theoretical analysis of experimental results
  5. Further discussion: discuss phenomena that are worth investigating further

The present blog post aims to go through a part of stage 2. In particular, today I aim to share interesting observations concerning vanilla three-layer rectifier networks with 500 nodes per layer trained on the MNIST dataset without dropout or unsupervised pre-training.

Visualizing binary in activation space:

two dimensional embedding of binary activations

Above we have a two dimensional linear embedding of binary representations which was obtained by applying PCA to the concatenated output of hidden layers, where a binary mask was applied to the output of each layer. This method is inspired by [5] where the authors used a similar method to study local competition among subnetworks within deep rectifier networks. Although I didn’t manage to get clusters that are as well-separated as R. Srivastava, we have clear evidence of emergent organisation among subnetworks within deep rectifier networks.

In particular, we may note that 1 is very near 2, 7 is near 9, 0 blends with 4. A Canadian AI researcher might argue that 0 is entangled with 4 [6]. However, the explained variance due to PCA(n=2) was around 40% which means that a lot of information was lost in the process of going from 1500 dimensions to 2 dimensions. This suggests that we might need a more reliable method for analysing variable disentangling.

Variable disentangling:

The average Euclidean distance between representations per class:

What the above heatmap shows is the average euclidean distance between binary representations for a particular class label, which is useful as the average value gives an indication of the relative contribution of each node when predicting a particular class. In particular, we note that 7 appears to be quite close to 9 but 0 doesn’t appear to be particularly close to 4. This is why I always use low dimensional visualizations with caution.

I also tried a different approach for analysing variable disentangling which gave very interesting and unexpected results.

Fraction of nodes shared per class:

The above heatmap shows that the fraction of pair-wise nodes shared per class is always above 90% which is quite surprising. Basically this means that different subnetworks that are tasked with predicting different things often share at least 90% of their nodes. What this means is that there is basically a core representation that is frequently reused with some small variations between each example and these small variations are very important. In some sense the deep rectifier network is very efficient at sharing resources and I believe this relates well to the notion of local competition described by R. Srivastava in [5]. I also think it merits further study.

Prior to studying the fraction of shared nodes between subnetworks, I imagined that the relative sparsity of activity in deep rectifier networks implied that the above observation would be quite improbable. In fact, the mean activations per hidden layer is something I looked into as well.

Mean activity per hidden layer per epoch:

If it’s not clear, the above set of histograms show the mean activations for each of the three layers for each of the five epochs. What I find interesting is that we observe:

  1. Convergence in distribution which was quantified using the conditional entropy.
  2. The mean activation for the first hidden layer has a mode around 0.5 whereas the mean activation for the second and third hidden layers have a mode around 0.7
  3. This indicates that on average (0.7+0.7+0.5)/3=63% of the nodes are used at any given time. Based on what I’ve read in [1] I would expect this fraction to decrease if we fix the width while we increase the depth of the network but it appears that we don’t yet have a good mathematical model to predict the number of active nodes given a dataset with a particular sample complexity.

Now, although it wasn’t suggested in any of the papers I’ve read so far I figured that I could probably use the mean activations per hidden layer to study variable-size representation as well as sparsity. My reasoning was that if a particular class required subnetworks with more nodes than another class on average then this would probably capture the notion of variable-size representation as described in [6]:

Varying the number of active neurons allows a model to control the effective dimensionality of the representation for a given input and the required precision. - X. Glorot, Y. Bengio & A. Bordes

Variable-size representation:

rank 0 1 2 3 4 5 6 7 8 9
variable size 1 9 3 7 6 5 2 4 8 10

This table effectively shows how the neural network represents the relative dimensionality of each class. The way I obtained this was by calculating the average number of nodes used to predict each class and then ranking these values by size.

An interesting and essential follow-up question is whether this relative order is respected when we train a rectifier network with the same architecture on a sample of the 10 original classes. If we did pair-wise experiments for example we would have to do 45 experiments. If the relative order is difficult to reproduce then we have a problem with the notion of variable size. Right now I am not sure whether there’s a simple theory that would explain how a neural network controls variable size. The only way to find out is to do the experiments.

The example ordering problem:

Finally, I also tried to take a look at the example ordering problem, a limitation of gradient descent for training neural networks that was noted by [3]. As they noted, the relative contribution of each epoch to the model that emerges within a neural network isn’t representative of the information available per epoch. In fact, we observe that the weights change in a much more important manner during the earlier epochs compared to later epochs:

What the above plot shows is that the change in the weight norm is more important during the earlier epochs compared to the later epochs. This is consistent with the model of gradient descent obtained in [4] which emphasises an approximate isomorphism between gradient descent and high-dimensional damped oscillators but this isn’t good news. Basically, this means that gradient descent is not a data efficient method for learning signals from data.

I think that the only way to avoid this is to add temporal memory to networks so they may perform inference forwards and backwards in time. If I analysed this problem further I would probably rediscover one of the many recurrent neural network architectures or I might discover my own architecture. Very often it’s useful to approach problems as if they haven’t been investigated before as that’s the only way to become a good theoretician.


This marks the end of my first observational study and I think you’ll agree with me that it has highlighted many questions that are worth further investigation. So you might ask, what next? I plan to do more detailed observational studies on the following questions in the following order:

  1. Sparsity of representations as we increase the depth of a rectifier network while keeping the width constant
  2. Stability of relative variable size for randomly chosen subclasses
  3. Solutions to the example ordering problem

I will continue to use the MNIST dataset but I will try to find models that take into account sample complexity for each of the above questions so these models should generalise well. Once I’ve gone through these semi-formal observational studies which are useful for developing intuitions I’ll proceed with the experiments I’ve defined earlier.

Note: If you’d like to repeat this analysis, the code I used to perform this analysis is available here but I would wait until the weekend because I’m going to make important changes. It’s a bit of a mess at the moment.


  1. Representation Learning: A Review and New Perspectives (Y. Bengio et al. 2013. IEEE Transactions on Pattern Analysis and Machine Intelligence.)
  2. Dropout: A Simple Way to Prevent Neural Networks from Overfitting (N. Srivastava et al. 2014. Journal of Machine Learning Research.)
  3. Why Does Unsupervised Pre-training Help Deep Learning? (D. Erhan et al. 2010. Journal of Machine Learning Research.)
  4. The Physical Systems behind Optimization (L. Yang et al. 2017.)
  5. Understanding Locally Competitive Networks (R. Srivastava et al. 2015.)
  6. Deep Sparse Rectifier Neural Networks (X. Glorot, A. Bordes & Y. Bengio. 2011. Journal of Machine Learning Research.)