derivation of common activation functions
In this blog post I’d like to show how commonly used activation functions can be derived from the sigmoid activation function. As a result, we can show that these functions have a shared mathematical lineage with the sigmoid.

sigmoid:

hyperbolic tangent:
Now, we note that:
From this it follows that we have:

softplus:
Now, if we compute the integral of the sigmoid:
where is an arbitrary constant.

ReLU:
Note that in when ,
From this we may deduce the much more computationally efficient ReLU activation:
What I find very interesting is that although these activation functions can all be derived from the sigmoid they have very different properties from the sigmoid. I’m not sure we can derive all the emergent properties of a neural network with a particular function using the tools of real analysis but this is an interesting question that I shall certainly revisit in the near future.
References:
 Understanding the difficulty of training deep feedforward neural networks (X. Glorot & Y. Bengio. 2010. AISTATS.)
 Rectified Linear Units Improve Restricted Boltzmann Machines (V. Nair & G. Hinton. 2010. ICML.)