I often see people mystifying artificial neural networks, like they would think or reason. Actually they are doing simple curve fitting and I have used the following example with students to take the message across. I’m distilling it into a blog post in the hope that it will be useful for others as well. The code to produce the images is available here.

## Fitting a sinusoid with a neural network

For this example we are going to use a very simple neural network, with one input and one output node, and one hidden layer. This allows us to examine each of the hidden nodes separately to see how they contribute to the end result.

\(h_i = f(w_i x + b_i)\)

\(y = \sum_{i=1}^{n} v_i h_i + c\)

\(f(x) = \frac{1}{1 + e^{-x}}\)

We are going to fit sine function, meaning that input to the network is real value \(x\) and the network is expected to produce \(y=sin(x)\). We use a fixed number of samples (50) as training set, and we add a little bit of noise to the real function value to make it a bit more interesting (and figures easier to read). In the following figure the original function is blue and the noisy training samples that network sees are light green.

## Using sigmoid activation

As a first step we are going to use 10 hidden nodes and sigmoid activation function. Below I have plotted the network output after each training iteration (1000 iterations in total). It’s fun to watch how the network shifts the sigmoids around and bends the curve to better match the original function:As the final result looks fairly good, we can now examine the contribution of individual hidden nodes. On the figure below the output of each hidden node \(h_i\) is plotted (multiplied by its outgoing weight \(v_i\)) along with the final curve.

Notice how each hidden node takes care of one side of each sinusoid bump. Indeed, the shape of sigmoid matches more or less the curve of sine function, so they can be efficiently used to model it. Different node outputs saturate at different levels, but in the end they are all summed and cancel out each other (bias takes care of the remainder).

What happens if we use this model to predict sine value outside of the training range? It is completely useless and there is no reason to blame the model – all it saw was the training set and there is no reason have any particular assumptions about data outside of that!

This gives an intuition how universal approximation theorem works – you just need enough hidden nodes (sigmoids) to model every wiggle of your function. And it also makes it clear why this theorem is completely useless – for the above range we would need 7 hidden nodes (we use 10, because random initialization might not position all of them in optimal location), but if we would attempt to model entire sine input range, we would need infinitely many.

**Takeaway:**

Vanilla neural networks just do interpolation between data points, they are completely unable to do any extrapolation. Do not expect the network to recognize a cartoon dog when you have trained it with only real dog images.

## What about Relu?

For the fun let’s try the above network also with popular ReLU activation: \(f(x)=max(x,0)\). This turned out to be much more challenging, I had to increase the number of hidden nodes to 100 to reliably achieve a good fit (presumably because of dead Relus). But in the end it learned well:

Notice how the learned curve is piecewise linear, which comes directly from the linearity of Relu. In the end the network uses Relus in similar way as sigmoids to match the curve of sine function. Individual neuron outputs look like this:

While sigmoid is saturated from both sides, Relu has to be counter-balanced with another Relu to prevent the value going to infinity. Can you see how the vanishing gradient problem is encoded into nature of activation functions? If activation function does not have a nearly constant side with (almost) zero gradient, then it becomes really hard to balance all contributions of hidden nodes.

Of course using Relu does not help in any way in extrapolation:

**Takeaway:**

If shape of activation function matches the fitted function, the training is much easier and the network needs less hidden nodes. For example in Mujoco tasks people use hidden layers with tanh non-linearity instead of Relu, presumably because outputs of those policies are smooth torque values.

## Using the right prior

What we can do to fix the extrapolation? First notice that sine is a periodic function, so we would need to incorporate periodicity somehow in our neural network. RNN maybe? Turns out the easiest solution is to use periodic activation function. In this case I opted for cosine, because it is similar to sine and readily available in libraries.

Some might say this is cheating, because sine is just a cosine with shifted phase. Yes, but the network still has to learn the right frequency and phase, because initialization sets both around zero while the right values are 1 and \(\frac{3}{4}\pi\). It might be interesting to experiment with periodic Relu – the saw wave. This can be easily implemented with mod operation, but it is unclear to me how to propagate gradient to modulo, which controls the frequency.

Anyway, the network with cosine activation function learns really fast and I can get pretty reliably results with just 3 nodes. If you are lucky with initialization or just train longer, even one node is enough (as you would guess).

As a matter of fact the network does Fourier analysis of the fitted function – the first layer weights \(w_i\) control the frequencies of cosine waves, biases \(b_i\) control the phases, second layer weights \(v_i\) are the amplitudes and second layer bias \(c\) applies the final shift if needed. In this simple case just one cosine is enough, other two just attempt to model the noise:

As you would guess, this network handles effortlessly the extrapolation task:

**Takeaway:**

If you encode the right prior into your network, it makes generalization much easier. Prior can be in any form, from activation function to network architecture. Examples of successful priors are convolution and attention.

## Final words

I do not want to diminish the value of neural networks, they have achieved amazing things in image recognition, speech recognition and general gameplay. Right priors help to generalize outside of training set, for example with convolutional network a dog is recognized anywhere on the image, instead of only those positions where it appeared in the training set. Sometimes encoding priors in the network can be replaced with data augmentation, for example in speech recognition you can add background noise to the audio during training to make the speech recognizer more robust. More layers exponentially increases the expressive power of the network. But hoping that stack of fully connected layers is going to solve your problem because neural networks are “universal approximators” is doomed to fail. You need right priors in your network and neuroscience might be useful to figure those out.

*Thanks to Ilya Kuzovkin, Jaan Aru and Roman Ring for insightful comments and discussions.*