Intro
In the previous sample only the input layer uses a bias. The bias was handled by conveniently appending a 1 to the input vector.
Now we'll see what do we need to do to add a bias term to every layer
For impatients:
- All you need to do is to append a 1 to the vector output by a layer, and there you go.
- The weight matrices wont be 2x2 but will be 3x2
Now the details:
This sample uses a 2x4x1 NN to compute learn xor operation.
This tutorial shows the following:
- How to compute the derivatives of the Error function of a 2x2 NN using the chain rule
- How to express the derivatives using matrices, this helps in many ways:
- Generalizing the 2x2 configuration into other configuration
- Vectorization, by doing many operations at once we can speed up computations.
- Batching, by applying the forward pass to the whole set, averaging all the gradients, and using them to update the weights
Todo:
- batched/mini batching training
- dropout
- Add momentum to learning
Computing $\frac{\partial{Error}}{\partial{weights}}$
.
Let's consider this simple configuration and later on we'll generalize this for any configuration
F H
i1 o-------a--o-------e--o
\ / \ /
\ / \ /
\ b \ f
\ / \ /
\/ \/
/\ /\
/ \ / \
/ c / g
/ \ / \
/ \ / \
i2 o-------d--o-------h--o
G I
$$\require{cancel}$$
This is a forward pass:
$$\begin{align*}
F &= \sigma(net_F) &= \sigma(a\cdot i_1 + b\cdot i_2 + b_F) \\
G &= \sigma(net_G) &= \sigma(c\cdot i_1 + d\cdot i_2 + b_G) \\
H &= \sigma(net_H) &= \sigma(e\cdot F+f\cdot G + b_H) \\
I &= \sigma(net_I) &= \sigma(g\cdot F+h\cdot G + b_I)
\end{align*}$$
We are starting to see some matrices here...
$$\begin{pmatrix}net_F\\
net_G\\
\end{pmatrix} = \begin{pmatrix} a & b \\ c & d
\end{pmatrix} \cdot \begin{pmatrix}i_1\\
i_2\\ \end{pmatrix} + \begin{pmatrix}b_F\\
b_G\\
\end{pmatrix}$$
$$\begin{pmatrix}net_H\\
net_I\\
\end{pmatrix} = \begin{pmatrix} e & f \\ g & h
\end{pmatrix} \cdot \begin{pmatrix}F\\
G\\ \end{pmatrix} + \begin{pmatrix}b_H\\
b_I\\
\end{pmatrix}$$
And the full forward operation woudl be given by:
$$\begin{pmatrix}H\\
I\\
\end{pmatrix} = \sigma \Bigg( \begin{pmatrix} e & f \\ g & h
\end{pmatrix} \cdot \sigma \Bigg( \begin{pmatrix} a & b \\ c & d
\end{pmatrix} \cdot \begin{pmatrix}i_1\\
i_2\\
\end{pmatrix} + \begin{pmatrix}b_F\\
b_G\\
\end{pmatrix} \Bigg) + \begin{pmatrix}b_H\\
b_I\\
\end{pmatrix} \Bigg)
$$
Now we want to see how $H$ and $I$ change when the weights $a ... b$ change. This is given by the derivative.
For computing derivatives we'll be using the chain rule. We'll need some formulas:
$$ \frac{d\sigma(f(...))}{df(...)} = \sigma'(f(...)) \cdot f'(...)'$$
$$\frac{\partial{H}}{\partial{e}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{e}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot G + b_H)}}{\partial{e}} = \sigma'(net_H) F$$
$$\frac{\partial{H}}{\partial{f}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{f}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot G + b_H)}}{\partial{f}} = \sigma'(net_H) G$$
$$\frac{\partial{H}}{\partial{b_H}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{f}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot G + b_H)}}{\partial{b_H}} = \sigma'(net_H) \cdot 1$$
$$\frac{\partial{I}}{\partial{g}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{g}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot G + b_I)}}{\partial{g}} = \sigma'(net_I) F$$
$$\frac{\partial{I}}{\partial{h}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{h}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot G + b_I)}}{\partial{h}} = \sigma'(net_I) G$$
$$\frac{\partial{I}}{\partial{b_I}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b_I}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot G + b_I)}}{\partial{b_I}} = \sigma'(net_I) \cdot 1$$
Starting from the output we immediatelly see a jacobian:
$$\begin{pmatrix}\frac{\partial{H}}{\partial{e}} & \frac{\partial{H}}{\partial{f}} & \frac{\partial{H}}{\partial{b_H}}\\
\frac{\partial{I}}{\partial{g}} & \frac{\partial{I}}{\partial{h}} & \frac{\partial{I}}{\partial{b_I}}\\
\end{pmatrix} = \begin{pmatrix} \sigma'(net_H) F & \sigma'(net_H) G & \sigma'(net_H)1\\ \sigma'(net_I)F & \sigma'(net_I)G & \sigma'(net_I)1\end{pmatrix} = \sigma'(\begin{pmatrix}net_H\\
net_I\\
\end{pmatrix}) \cdot \begin{pmatrix} F & G & 1\end{pmatrix}$$
Let see how the outputs change with the weights $a .. d$:
For H:
$$\frac{\partial{H}}{\partial{a}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{a}} =\sigma'(net_H) \frac{\partial{(e\cdot \sigma(net_F)+f\cdot G)}}{\partial{a}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_1$$
$$\frac{\partial{H}}{\partial{b}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{b}} =\sigma'(net_H) \frac{\partial{(e\cdot \sigma(net_F)+f\cdot G)}}{\partial{b}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_2$$
$$\frac{\partial{H}}{\partial{b_F}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{b_F}} =\sigma'(net_H) \frac{\partial{(e\cdot \sigma(net_F)+f\cdot G)}}{\partial{b_F}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot 1$$
$$\frac{\partial{H}}{\partial{c}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{c}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot \sigma(net_G))}}{\partial{c}} = \sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_1$$
$$\frac{\partial{H}}{\partial{d}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{d}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot \sigma(net_G))}}{\partial{d}} = \sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_2$$
$$\frac{\partial{H}}{\partial{b_G}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{b_G}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot \sigma(net_G))}}{\partial{b_G}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_G) \cdot 1$$
For I:
$$\frac{\partial{I}}{\partial{a}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{a}} =\sigma'(net_I) \frac{\partial{(g\cdot \sigma(net_F)+h\cdot G)}}{\partial{a}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_1$$
$$\frac{\partial{I}}{\partial{b}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b}} =\sigma'(net_I) \frac{\partial{(g\cdot \sigma(net_F)+h\cdot G)}}{\partial{b}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_2$$
$$\frac{\partial{I}}{\partial{b_H}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b_H}} =\sigma'(net_I) \frac{\partial{(g\cdot \sigma(net_F)+h\cdot G)}}{\partial{b_H}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot 1$$
$$\frac{\partial{I}}{\partial{c}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{c}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot \sigma(net_G))}}{\partial{c}} = \sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_1$$
$$\frac{\partial{I}}{\partial{d}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{d}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot \sigma(net_G))}}{\partial{d}} = \sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_2$$
$$\frac{\partial{I}}{\partial{b_I}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b_I}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot \sigma(net_G))}}{\partial{b_I}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot 1$$
Now we have all the ingredients to train our network:
First we need a forward pass, we take the output and we compute how far are we from the target result $t_1 and t_2$:
$$Error = (H-t_1)^2 + (I-t_2)^2$$
And it's derivative is:
$$d(Error) = 2(H-t_1)dH + 2(I-t_2)dI= 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
dH\\
dI
\end{pmatrix}$$
Its partial derivatives are:
$$\frac{\partial{Error}}{\partial{e}} = 2(H-t_1)dH + 2(I-t_2)dI= 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\frac{\partial{H}}{\partial{e}}\\
\cancelto{0}{\frac{\partial{I}}{\partial{e}}}
\end{pmatrix} = 2(H-t_1) \frac{\partial{H}}{\partial{e}} $$
$$\frac{\partial{Error}}{\partial{e}} = 2(H-t_1) \frac{\partial{H}}{\partial{e}} = 2(H-t_1) \sigma(net_H)F $$
$$\frac{\partial{Error}}{\partial{f}} = 2(H-t_1) \frac{\partial{H}}{\partial{f}} = 2(H-t_1) \sigma(net_H)G $$
$$\frac{\partial{Error}}{\partial{b_H}} = 2(H-t_1) \frac{\partial{H}}{\partial{b_H}} = 2(H-t_1) \sigma(net_H)1 $$
$$\frac{\partial{Error}}{\partial{g}} = 2(I-t_2) \frac{\partial{I}}{\partial{g}} = 2(I-t_2) \sigma(net_I)F $$
$$\frac{\partial{Error}}{\partial{h}} = 2(I-t_2) \frac{\partial{I}}{\partial{h}} = 2(I-t_2) \sigma(net_I)G $$
$$\frac{\partial{Error}}{\partial{b_I}} = 2(I-t_2) \frac{\partial{I}}{\partial{b_I}} = 2(I-t_2) \sigma(net_I)1 $$
rearranging:
$$ \begin{pmatrix}\frac{\partial{E}}{\partial{e}} & \frac{\partial{E}}{\partial{f}} & \frac{\partial{E}}{\partial{b_H}}\\
\frac{\partial{E}}{\partial{g}} & \frac{\partial{E}}{\partial{h}} & \frac{\partial{E}}{\partial{b_I}}\\
\end{pmatrix} = 2 \cdot
\begin{pmatrix}
\sigma'(net_H) \cdot (H-t_1) \\
\sigma'(net_I) \cdot (I-t_2) \\
\end{pmatrix}
\cdot
\begin{pmatrix} F & G & 1\end{pmatrix}
= \begin{pmatrix}
\sigma'(net_H) & 0 \\
0 & \sigma'(net_I)\\
\end{pmatrix}
\cdot
2
\cdot
\begin{pmatrix}
(H-t_1) \\
(I-t_2) \\
\end{pmatrix}
\cdot
\begin{pmatrix} F & G & 1\end{pmatrix}
$$
and in a similar way:
$$\frac{\partial{Error}}{\partial{a}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\frac{\partial{H}}{\partial{a}}\\
\frac{\partial{I}}{\partial{a}}
\end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_1\\
\sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_1
\end{pmatrix} $$
$$\frac{\partial{Error}}{\partial{b}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\frac{\partial{H}}{\partial{b}}\\
\frac{\partial{I}}{\partial{b}}
\end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_2\\
\sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_2
\end{pmatrix} $$
$$\frac{\partial{Error}}{\partial{b_F}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\frac{\partial{H}}{\partial{b_F}}\\
\frac{\partial{I}}{\partial{b_F}}
\end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot 1\\
\sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot 1
\end{pmatrix} $$
$$\frac{\partial{Error}}{\partial{c}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\frac{\partial{H}}{\partial{c}}\\
\frac{\partial{I}}{\partial{c}}
\end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_1\\
\sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_1
\end{pmatrix} $$
$$\frac{\partial{Error}}{\partial{d}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\frac{\partial{H}}{\partial{d}}\\
\frac{\partial{I}}{\partial{d}}
\end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_2\\
\sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_2
\end{pmatrix} $$
$$\frac{\partial{Error}}{\partial{b_G}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\frac{\partial{H}}{\partial{b_G}}\\
\frac{\partial{I}}{\partial{b_G}}
\end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot 1\\
\sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot 1
\end{pmatrix} $$
simplifying a bit
$$\frac{\partial{Error}}{\partial{a}} =
2\cdot \sigma'(net_F) \cdot i_1 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot e \\
\sigma'(net_I) \cdot g
\end{pmatrix} = 2\cdot \sigma'(net_F) \cdot i_1 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot e) + ((I-t_2) \cdot \sigma'(net_I) \cdot g))
$$
$$\frac{\partial{Error}}{\partial{b}} =
2\cdot \sigma'(net_F) \cdot i_2 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot e\\
\sigma'(net_I) \cdot g
\end{pmatrix} = 2\cdot \sigma'(net_F) \cdot i_2 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot e) + ((I-t_2) \cdot \sigma'(net_I) \cdot g))$$
$$\frac{\partial{Error}}{\partial{c}} =
2\cdot \sigma'(net_G) \cdot i_1 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot f \\
\sigma'(net_I) \cdot h
\end{pmatrix} = 2\cdot \sigma'(net_G) \cdot i_1 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot f) + ((I-t_2) \cdot \sigma'(net_I) \cdot h))$$
$$\frac{\partial{Error}}{\partial{d}} =
2\cdot \sigma'(net_G) \cdot i_2 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix}
\sigma'(net_H) \cdot f \\
\sigma'(net_I) \cdot h
\end{pmatrix} = 2\cdot \sigma'(net_G) \cdot i_2 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot f) + ((I-t_2) \cdot \sigma'(net_I) \cdot h))$$
hence
$$ \begin{pmatrix}\frac{\partial{E}}{\partial{a}} & \frac{\partial{E}}{\partial{b}} \\
\frac{\partial{E}}{\partial{c}} & \frac{\partial{E}}{\partial{d}}\\
\end{pmatrix} =
\begin{pmatrix}
\sigma'(net_F) & 0 \\
0 & \sigma'(net_G)\\
\end{pmatrix}
\cdot
\begin{pmatrix}
e & g \\
f & h\\
\end{pmatrix}
\cdot
\Bigg[
2 \cdot
\begin{pmatrix}
\sigma'(net_H) \cdot (H-t_1) \\
\sigma'(net_I) \cdot (I-t_2) \\
\end{pmatrix}
\Bigg]
\cdot
\begin{pmatrix}
i_1 & i_2 & 1\\
\end{pmatrix}
$$
Notice how the expression within aquare brackes was computed early before.
Let's now generalize the above structure for any network..
Contact/Questions:
<my_github_account_username>$@gmail.com$.