Simple Linear regression

Intro

Let's say we have a system with 2 inputs, $x$ and $y$, that outputs a value according to the following mistery formula: $$m(x,y)=4x+2y+6$$ Let's say we don't know this formula, we can only know it's output given a $(x,y)$.

Using 'trial and error', let's try to figure out how that formula looks like. For the shake of simplicity let's assume we know the formula has the following form: $$f(x,y)=ax+by+c$$ How would we go about? Using the mistery formula we can compute a valid result, for example: $m(4,3)=28$ Since we are using trial and error we'd start by choosing some random values for $a$, $b$ and $c$, we'd apply our formula $f(x,y)$ and then we'd see how close we are from the function $m$.

The 'how close' is given by computing the distance between the good result and ours. In this case this distance would be given by: $$E(a,b,c) = (f(x,y,a,b,c)-m(x,y))^2)$$

Notice we dont know the internals of $m$.

The question is, how does $E(a,b,c)$ change when I change $a$, $b$ and $c$? In maths, this is given by the derivative: $$dE(a,b,c) = \frac{\partial{E}}{\partial{a}}da + \frac{\partial{E}}{\partial{b}}db + \frac{\partial{E}}{\partial{c}}dc$$ in out case, and using the chain rule: $$df(a,b,c) = 2f(a,b,c) (\frac{\partial{f}}{\partial{a}}, \frac{\partial{f}}{\partial{b}}, \frac{\partial{f}}{\partial{c}})$$ Since the E funtion is a cuadratic function, we know it has a minimum. And more over the derivative is a vector that points toward this minimum.

So the algorithm is, we start at a random position for $a$, $b$ and $c$, and then we move in small steps ($\lambda$) in the direction of the derivative: $$ (a',b',c') = (a,b,c) - \lambda dE(a,b,c)$$ by repeating this over and over we'll end up reaching the minumum. This method is called the 'Gradient descent'.

Oh and BTW our function $f(x,y,a,b,c)$, can be expressed: $$ f(x_1,x_2,w_1,w_2,b) = x_1w_1 + x_2w_2 + b$$ hence: $$ f(x_i,w_i) = \sum{x_iw_i} + b$$ and this is starting to shape up as a neuron. All we'd need to add is the sigmoid function to smoothly clamp the outputs into values from -1 to 1.

$$ f(x_i,w_i)=sig(\sum{w_ix_i})$$ In a neural network the outputs of neurons are connected to the inputs of another neuron, this matehmatically is expressed this way: $$f(g(h(x_i, w_{hi}), w_{gi}), w_{fi})$$ So, once again the error is computed as before: $$E(a,b,c) = (f(...)-m(x,y))^2)$$ and the derivative of the error would computed as before, but now our $a$ varibles are called $w_i$, so for a $w_hi$ we'd have that: $$\frac{\partial{f(g(h(x_i, w_{hi}), w_{gi}), w_{fi})}}{w_{hi}}= \frac{\partial{f}}{\partial{g}}\frac{\partial{g}}{\partial{h}}\frac{\partial{i(x_i, w_{hi})}}{\partial{w_{hi}}}$$ That is, thanks to the chain rule, the derivative of our full network can be expressed as the derivatives of the individual neurons.

And the $w_hi$ would be updated as before, yielding now the following: $$ w'_{hi} = w_{hi} + \lambda df$$ Congratulations, you are one step closer to understanding the full monty.

Contact/Questions:

<my_github_account_username>$@gmail.com$.