Intro
Let's say we have a system with 2 inputs, $x$ and $y$, that outputs a value according to the following mistery formula:
$$m(x,y)=4x+2y+6$$
Let's say we don't know this formula, we can only know it's output given a $(x,y)$.
Using 'trial and error', let's try to figure out how that formula looks like. For the shake of simplicity let's assume we know the formula has the following form:
$$f(x,y)=ax+by+c$$
How would we go about?
Using the mistery formula we can compute a valid result, for example: $m(4,3)=28$
Since we are using trial and error we'd start by choosing some random values for $a$, $b$ and $c$, we'd apply our formula $f(x,y)$ and then we'd see how close we are from the function $m$.
The 'how close' is given by computing the distance between the good result and ours. In this case this distance would be given by:
$$E(a,b,c) = (f(x,y,a,b,c)-m(x,y))^2)$$
Notice we dont know the internals of $m$.
The question is, how does $E(a,b,c)$ change when I change $a$, $b$ and $c$?
In maths, this is given by the derivative:
$$dE(a,b,c) = \frac{\partial{E}}{\partial{a}}da + \frac{\partial{E}}{\partial{b}}db + \frac{\partial{E}}{\partial{c}}dc$$
in out case, and using the chain rule:
$$df(a,b,c) = 2f(a,b,c) (\frac{\partial{f}}{\partial{a}}, \frac{\partial{f}}{\partial{b}}, \frac{\partial{f}}{\partial{c}})$$
Since the E funtion is a cuadratic function, we know it has a minimum. And more over the derivative is a vector that points toward this minimum.
So the algorithm is, we start at a random position for $a$, $b$ and $c$, and then we move in small steps ($\lambda$) in the direction of the derivative:
$$ (a',b',c') = (a,b,c) - \lambda dE(a,b,c)$$
by repeating this over and over we'll end up reaching the minumum. This method is called the 'Gradient descent'.
Oh and BTW our function $f(x,y,a,b,c)$, can be expressed:
$$ f(x_1,x_2,w_1,w_2,b) = x_1w_1 + x_2w_2 + b$$
hence:
$$ f(x_i,w_i) = \sum{x_iw_i} + b$$
and this is starting to shape up as a neuron. All we'd need to add is the sigmoid function to smoothly clamp the outputs into values from -1 to 1.
$$ f(x_i,w_i)=sig(\sum{w_ix_i})$$
In a neural network the outputs of neurons are connected to the inputs of another neuron, this matehmatically is expressed this way:
$$f(g(h(x_i, w_{hi}), w_{gi}), w_{fi})$$
So, once again the error is computed as before:
$$E(a,b,c) = (f(...)-m(x,y))^2)$$
and the derivative of the error would computed as before, but now our $a$ varibles are called $w_i$, so for a $w_hi$ we'd have that:
$$\frac{\partial{f(g(h(x_i, w_{hi}), w_{gi}), w_{fi})}}{w_{hi}}= \frac{\partial{f}}{\partial{g}}\frac{\partial{g}}{\partial{h}}\frac{\partial{i(x_i, w_{hi})}}{\partial{w_{hi}}}$$
That is, thanks to the chain rule, the derivative of our full network can be expressed as the derivatives of the individual neurons.
And the $w_hi$ would be updated as before, yielding now the following:
$$ w'_{hi} = w_{hi} + \lambda df$$
Congratulations, you are one step closer to understanding the full monty.
Contact/Questions:
<my_github_account_username>$@gmail.com$.