Neural network back prop using simple gradient descent, with matrix fomulation + bias for each layer

Intro

In the previous sample only the input layer uses a bias. The bias was handled by conveniently appending a 1 to the input vector.

Now we'll see what do we need to do to add a bias term to every layer

For impatients: Now the details: This sample uses a 2x4x1 NN to compute learn xor operation.
This tutorial shows the following:

Todo:

Computing $\frac{\partial{Error}}{\partial{weights}}$

. Let's consider this simple configuration and later on we'll generalize this for any configuration
              F          H
i1 o-------a--o-------e--o
    \        / \        /
     \      /   \      /
      \    b     \    f
       \  /       \  /          
        \/         \/  
        /\         /\                                   
       /  \       /  \ 
      /    c     /    g                               
     /      \   /      \                                   
    /        \ /        \                                       
i2 o-------d--o-------h--o
              G          I
$$\require{cancel}$$ This is a forward pass: $$\begin{align*} F &= \sigma(net_F) &= \sigma(a\cdot i_1 + b\cdot i_2 + b_F) \\ G &= \sigma(net_G) &= \sigma(c\cdot i_1 + d\cdot i_2 + b_G) \\ H &= \sigma(net_H) &= \sigma(e\cdot F+f\cdot G + b_H) \\ I &= \sigma(net_I) &= \sigma(g\cdot F+h\cdot G + b_I) \end{align*}$$ We are starting to see some matrices here... $$\begin{pmatrix}net_F\\ net_G\\ \end{pmatrix} = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \cdot \begin{pmatrix}i_1\\ i_2\\ \end{pmatrix} + \begin{pmatrix}b_F\\ b_G\\ \end{pmatrix}$$ $$\begin{pmatrix}net_H\\ net_I\\ \end{pmatrix} = \begin{pmatrix} e & f \\ g & h \end{pmatrix} \cdot \begin{pmatrix}F\\ G\\ \end{pmatrix} + \begin{pmatrix}b_H\\ b_I\\ \end{pmatrix}$$ And the full forward operation woudl be given by: $$\begin{pmatrix}H\\ I\\ \end{pmatrix} = \sigma \Bigg( \begin{pmatrix} e & f \\ g & h \end{pmatrix} \cdot \sigma \Bigg( \begin{pmatrix} a & b \\ c & d \end{pmatrix} \cdot \begin{pmatrix}i_1\\ i_2\\ \end{pmatrix} + \begin{pmatrix}b_F\\ b_G\\ \end{pmatrix} \Bigg) + \begin{pmatrix}b_H\\ b_I\\ \end{pmatrix} \Bigg) $$ Now we want to see how $H$ and $I$ change when the weights $a ... b$ change. This is given by the derivative.

For computing derivatives we'll be using the chain rule. We'll need some formulas: $$ \frac{d\sigma(f(...))}{df(...)} = \sigma'(f(...)) \cdot f'(...)'$$ $$\frac{\partial{H}}{\partial{e}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{e}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot G + b_H)}}{\partial{e}} = \sigma'(net_H) F$$ $$\frac{\partial{H}}{\partial{f}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{f}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot G + b_H)}}{\partial{f}} = \sigma'(net_H) G$$ $$\frac{\partial{H}}{\partial{b_H}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{f}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot G + b_H)}}{\partial{b_H}} = \sigma'(net_H) \cdot 1$$ $$\frac{\partial{I}}{\partial{g}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{g}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot G + b_I)}}{\partial{g}} = \sigma'(net_I) F$$ $$\frac{\partial{I}}{\partial{h}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{h}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot G + b_I)}}{\partial{h}} = \sigma'(net_I) G$$ $$\frac{\partial{I}}{\partial{b_I}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b_I}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot G + b_I)}}{\partial{b_I}} = \sigma'(net_I) \cdot 1$$ Starting from the output we immediatelly see a jacobian: $$\begin{pmatrix}\frac{\partial{H}}{\partial{e}} & \frac{\partial{H}}{\partial{f}} & \frac{\partial{H}}{\partial{b_H}}\\ \frac{\partial{I}}{\partial{g}} & \frac{\partial{I}}{\partial{h}} & \frac{\partial{I}}{\partial{b_I}}\\ \end{pmatrix} = \begin{pmatrix} \sigma'(net_H) F & \sigma'(net_H) G & \sigma'(net_H)1\\ \sigma'(net_I)F & \sigma'(net_I)G & \sigma'(net_I)1\end{pmatrix} = \sigma'(\begin{pmatrix}net_H\\ net_I\\ \end{pmatrix}) \cdot \begin{pmatrix} F & G & 1\end{pmatrix}$$ Let see how the outputs change with the weights $a .. d$:

For H: $$\frac{\partial{H}}{\partial{a}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{a}} =\sigma'(net_H) \frac{\partial{(e\cdot \sigma(net_F)+f\cdot G)}}{\partial{a}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_1$$ $$\frac{\partial{H}}{\partial{b}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{b}} =\sigma'(net_H) \frac{\partial{(e\cdot \sigma(net_F)+f\cdot G)}}{\partial{b}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_2$$ $$\frac{\partial{H}}{\partial{b_F}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{b_F}} =\sigma'(net_H) \frac{\partial{(e\cdot \sigma(net_F)+f\cdot G)}}{\partial{b_F}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot 1$$ $$\frac{\partial{H}}{\partial{c}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{c}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot \sigma(net_G))}}{\partial{c}} = \sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_1$$ $$\frac{\partial{H}}{\partial{d}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{d}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot \sigma(net_G))}}{\partial{d}} = \sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_2$$ $$\frac{\partial{H}}{\partial{b_G}} = \sigma'(net_H) \frac{\partial{(net_H)}}{\partial{b_G}} =\sigma'(net_H) \frac{\partial{(e\cdot F+f\cdot \sigma(net_G))}}{\partial{b_G}} = \sigma'(net_H) \cdot e\cdot \sigma'(net_G) \cdot 1$$ For I: $$\frac{\partial{I}}{\partial{a}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{a}} =\sigma'(net_I) \frac{\partial{(g\cdot \sigma(net_F)+h\cdot G)}}{\partial{a}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_1$$ $$\frac{\partial{I}}{\partial{b}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b}} =\sigma'(net_I) \frac{\partial{(g\cdot \sigma(net_F)+h\cdot G)}}{\partial{b}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_2$$ $$\frac{\partial{I}}{\partial{b_H}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b_H}} =\sigma'(net_I) \frac{\partial{(g\cdot \sigma(net_F)+h\cdot G)}}{\partial{b_H}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot 1$$ $$\frac{\partial{I}}{\partial{c}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{c}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot \sigma(net_G))}}{\partial{c}} = \sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_1$$ $$\frac{\partial{I}}{\partial{d}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{d}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot \sigma(net_G))}}{\partial{d}} = \sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_2$$ $$\frac{\partial{I}}{\partial{b_I}} = \sigma'(net_I) \frac{\partial{(net_I)}}{\partial{b_I}} =\sigma'(net_I) \frac{\partial{(g\cdot F+h\cdot \sigma(net_G))}}{\partial{b_I}} = \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot 1$$ Now we have all the ingredients to train our network: First we need a forward pass, we take the output and we compute how far are we from the target result $t_1 and t_2$: $$Error = (H-t_1)^2 + (I-t_2)^2$$ And it's derivative is: $$d(Error) = 2(H-t_1)dH + 2(I-t_2)dI= 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} dH\\ dI \end{pmatrix}$$ Its partial derivatives are: $$\frac{\partial{Error}}{\partial{e}} = 2(H-t_1)dH + 2(I-t_2)dI= 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \frac{\partial{H}}{\partial{e}}\\ \cancelto{0}{\frac{\partial{I}}{\partial{e}}} \end{pmatrix} = 2(H-t_1) \frac{\partial{H}}{\partial{e}} $$ $$\frac{\partial{Error}}{\partial{e}} = 2(H-t_1) \frac{\partial{H}}{\partial{e}} = 2(H-t_1) \sigma(net_H)F $$ $$\frac{\partial{Error}}{\partial{f}} = 2(H-t_1) \frac{\partial{H}}{\partial{f}} = 2(H-t_1) \sigma(net_H)G $$ $$\frac{\partial{Error}}{\partial{b_H}} = 2(H-t_1) \frac{\partial{H}}{\partial{b_H}} = 2(H-t_1) \sigma(net_H)1 $$ $$\frac{\partial{Error}}{\partial{g}} = 2(I-t_2) \frac{\partial{I}}{\partial{g}} = 2(I-t_2) \sigma(net_I)F $$ $$\frac{\partial{Error}}{\partial{h}} = 2(I-t_2) \frac{\partial{I}}{\partial{h}} = 2(I-t_2) \sigma(net_I)G $$ $$\frac{\partial{Error}}{\partial{b_I}} = 2(I-t_2) \frac{\partial{I}}{\partial{b_I}} = 2(I-t_2) \sigma(net_I)1 $$ rearranging: $$ \begin{pmatrix}\frac{\partial{E}}{\partial{e}} & \frac{\partial{E}}{\partial{f}} & \frac{\partial{E}}{\partial{b_H}}\\ \frac{\partial{E}}{\partial{g}} & \frac{\partial{E}}{\partial{h}} & \frac{\partial{E}}{\partial{b_I}}\\ \end{pmatrix} = 2 \cdot \begin{pmatrix} \sigma'(net_H) \cdot (H-t_1) \\ \sigma'(net_I) \cdot (I-t_2) \\ \end{pmatrix} \cdot \begin{pmatrix} F & G & 1\end{pmatrix} = \begin{pmatrix} \sigma'(net_H) & 0 \\ 0 & \sigma'(net_I)\\ \end{pmatrix} \cdot 2 \cdot \begin{pmatrix} (H-t_1) \\ (I-t_2) \\ \end{pmatrix} \cdot \begin{pmatrix} F & G & 1\end{pmatrix} $$ and in a similar way: $$\frac{\partial{Error}}{\partial{a}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \frac{\partial{H}}{\partial{a}}\\ \frac{\partial{I}}{\partial{a}} \end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_1\\ \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_1 \end{pmatrix} $$ $$\frac{\partial{Error}}{\partial{b}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \frac{\partial{H}}{\partial{b}}\\ \frac{\partial{I}}{\partial{b}} \end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot i_2\\ \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot i_2 \end{pmatrix} $$ $$\frac{\partial{Error}}{\partial{b_F}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \frac{\partial{H}}{\partial{b_F}}\\ \frac{\partial{I}}{\partial{b_F}} \end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot e\cdot \sigma'(net_F) \cdot 1\\ \sigma'(net_I) \cdot g\cdot \sigma'(net_F) \cdot 1 \end{pmatrix} $$ $$\frac{\partial{Error}}{\partial{c}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \frac{\partial{H}}{\partial{c}}\\ \frac{\partial{I}}{\partial{c}} \end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_1\\ \sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_1 \end{pmatrix} $$ $$\frac{\partial{Error}}{\partial{d}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \frac{\partial{H}}{\partial{d}}\\ \frac{\partial{I}}{\partial{d}} \end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot i_2\\ \sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot i_2 \end{pmatrix} $$ $$\frac{\partial{Error}}{\partial{b_G}} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \frac{\partial{H}}{\partial{b_G}}\\ \frac{\partial{I}}{\partial{b_G}} \end{pmatrix} = 2\begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot f\cdot \sigma'(net_G) \cdot 1\\ \sigma'(net_I) \cdot h\cdot \sigma'(net_G) \cdot 1 \end{pmatrix} $$ simplifying a bit $$\frac{\partial{Error}}{\partial{a}} = 2\cdot \sigma'(net_F) \cdot i_1 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot e \\ \sigma'(net_I) \cdot g \end{pmatrix} = 2\cdot \sigma'(net_F) \cdot i_1 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot e) + ((I-t_2) \cdot \sigma'(net_I) \cdot g)) $$ $$\frac{\partial{Error}}{\partial{b}} = 2\cdot \sigma'(net_F) \cdot i_2 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot e\\ \sigma'(net_I) \cdot g \end{pmatrix} = 2\cdot \sigma'(net_F) \cdot i_2 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot e) + ((I-t_2) \cdot \sigma'(net_I) \cdot g))$$ $$\frac{\partial{Error}}{\partial{c}} = 2\cdot \sigma'(net_G) \cdot i_1 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot f \\ \sigma'(net_I) \cdot h \end{pmatrix} = 2\cdot \sigma'(net_G) \cdot i_1 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot f) + ((I-t_2) \cdot \sigma'(net_I) \cdot h))$$ $$\frac{\partial{Error}}{\partial{d}} = 2\cdot \sigma'(net_G) \cdot i_2 \cdot \begin{pmatrix} (H-t_1) & (I-t_2) \end{pmatrix} \cdot \begin{pmatrix} \sigma'(net_H) \cdot f \\ \sigma'(net_I) \cdot h \end{pmatrix} = 2\cdot \sigma'(net_G) \cdot i_2 \cdot (((H-t_1) \cdot \sigma'(net_H) \cdot f) + ((I-t_2) \cdot \sigma'(net_I) \cdot h))$$ hence $$ \begin{pmatrix}\frac{\partial{E}}{\partial{a}} & \frac{\partial{E}}{\partial{b}} \\ \frac{\partial{E}}{\partial{c}} & \frac{\partial{E}}{\partial{d}}\\ \end{pmatrix} = \begin{pmatrix} \sigma'(net_F) & 0 \\ 0 & \sigma'(net_G)\\ \end{pmatrix} \cdot \begin{pmatrix} e & g \\ f & h\\ \end{pmatrix} \cdot \Bigg[ 2 \cdot \begin{pmatrix} \sigma'(net_H) \cdot (H-t_1) \\ \sigma'(net_I) \cdot (I-t_2) \\ \end{pmatrix} \Bigg] \cdot \begin{pmatrix} i_1 & i_2 & 1\\ \end{pmatrix} $$ Notice how the expression within aquare brackes was computed early before. Let's now generalize the above structure for any network..

Contact/Questions:

<my_github_account_username>$@gmail.com$.