two dimensional embedding of binary activations


For my first set of Pauli Space experiments, I thought I would start by attempting to answer elementary questions which might lead to more data efficient deep models and algorithms. First, I shall explore the connection between unsupervised learning and representation learning which is what leads to better generalisation [1]. Second, I shall focus on questions that investigate the example ordering problem as I believe this represents a fundamental limitation of gradient-based algorithms [3]. One of these questions is whether bernoulli dropout counteracts the example ordering problem by encouraging a globally weaker-than-exponential rate of convergence as we have a new model that encounters new batches of examples at each iteration, effectively allowing exponentially many models to discover exponentially many local minima.

Unlike most scientists, I publicly announce my hypotheses prior to performing my experiments which avoids the revisionist tendency that is prevalent in many academic circles including machine learning. As my previous blog post made important remarks on the necessity for better scientific methodology within the field of machine learning, I plan to uphold this standard and improve upon it as I believe this shall naturally lead to better science.

Finally, the ultimate goal of this series of experiments shall be the same for all future experiments. That is to find powerful mathematical abstractions of deep models that make verifiable predictions.

Experimental setup:

  1. Computing:
    1. Device: Macbook Air 13”
    2. Processor: 1.6 GHz Intel Core i5
    3. Memory: 8 GB 1600 MHz DDR3
  2. Data:
    1. MNIST: 28x28 handwritten digits with 10 classes // train: 60k examples // test: 10k examples
    2. CIFAR-10: 32x32 color images with 10 classes // train: 50k examples // test: 10k examples
  3. Baseline models:
    • fully-connected ReLU network: [784/1024,500,500,500,10]
  4. Infrastructure:
    1. Keras
    2. tensorflow
  5. Timeline:
    1. Start: 2017-06-15
    2. End: 2017-06-19


The goal of this experiment is to perform the following analyses for deep rectifier networks trained with unsupervised pre-training vs without unsupervised pre-training and to develop theoretical interpretations of the observed results.

  1. Visualize the activation space:
    1. Apply a binary mask to ReLU activations then concatenate binary activations so we have a binary vector per example
    2. Apply PCA(n=2) to the binary vectors from the training set and visualise the resulting clusters
    3. Do we observe nice clustering in the activation space?
    4. Do we observe quantitative indications of variable disentangling?

  2. Analyse the total number of distinct binary vectors(i.e. representations) as a function of the number of classes:
    1. What happens when we train a model on only classes where we choose a subset of the 10 original classes?
    2. Is there an observable relationship between the number of representations per class and the number of classes?
    3. In particular, for fixed network size does the relative sparsity of subnetwork nodes increase as we increase the number of classes?

  3. Analyse the fraction of active units per class(i.e. variable-size representation):
    1. Is the order of the variable-size representation respected when I choose sub-classes?
    2. Can the empirical variable-size representation be assigned a theoretical interpretation?

  4. Analyse the example ordering problem:
    1. Is the rate of change of the weight norms much more important during the early epochs compared to later epochs?
    2. Does this problem exist to the same degree for all gradient-based optimizers and can we describe this relationship mathematically?
    3. To what extent is this problem anticipated by models of optimizers?
    4. Can we find training algorithms that are both efficient and don’t suffer from the example ordering problem?
    5. Does bernoulli dropout make this problem almost non-existent for gradient-based optimizers?
    6. How does weight normalisation alleviate this problem?

  5. How do the above analyses generalise:
    1. To ReLU networks with really wide layers?
    2. To ReLU networks with much greater depth?
    3. Can we find concise mathematical descriptions for these generalisations?


Here are my conjectures for a handful of questions:

1.4 We will observe clear indications of variable disentangling that can be determined by the fraction of shared nodes per class, and as the training set increases the variance of the fraction of shared nodes per class will decrease accordingly.

2.3 For fixed network size the relative sparsity of subnetwork nodes increase as we increase the number of classes. My argument for this is that sparsity is a result of local competition between subnetworks and competition(i.e. complex co-adaptation) becomes a bigger issue as we increase the number of classes [5]. This is my own interpretation of the paper by R. Srivastava.

3.1 The order of the variable-size representation (as measured by the fraction of active units per class) will be respected when I choose subclasses. My argument is that if this turns out to be false then the notion of variable-size representation will have to be redefined. At present it’s meant to capture the notion of an adaptively efficient encoding.

4.4 I believe we can find training algorithms that are both efficient and don’t suffer from the example ordering problem but we will need to augment networks with memory in order to do inference forwards and backwards in time. As a first approximation, we can probably use MonteCarlo dropout to perform inference but this will have to be investigated further.

5.1 I believe that if I can find quantitative theoretical justifications for experimental analyses 1-4, these analyses will generalise to large ReLU networks.

Note: I’ve listed the minimum number of papers which I think I must reference for this experiment but this list my expand. In particular, if you believe there’s a paper I ought to reference please let me know.


  1. Representation Learning: A Review and New Perspectives (Y. Bengio et al. 2013. IEEE Transactions on Pattern Analysis and Machine Intelligence.)
  2. Dropout: A Simple Way to Prevent Neural Networks from Overfitting (N. Srivastava et al. 2014. Journal of Machine Learning Research.)
  3. Why Does Unsupervised Pre-training Help Deep Learning? (D. Erhan et al. 2010. Journal of Machine Learning Research.)
  4. The Physical Systems behind Optimization (L. Yang et al. 2017.)
  5. Understanding Locally Competitive Networks (R. Srivastava et al. 2015.)
  6. Deep Sparse Rectifier Neural Networks (X. Glorot, A. Bordes & Y. Bengio. 2011. Journal of Machine Learning Research.)