cost function and backpropagation

cost function

L = total no. of layers in network.
$S_l$ is no. of units (not counting bias unit) in layer $l$

two classification

binary classification
$y = 0 \text{ or } 1$
one output unit $h_{\Theta} \in \mathbb{R}$ in output layer
multi-class classification (K class)
$y \in \mathbb{R}^K$, E.g.: $\left[ \begin{array} {c} 1\\0\\0\\0 \end{array} \right]$, $\left[ \begin{array} {c} 0\\1\\0\\0 \end{array} \right]$, $\left[ \begin{array} {c} 0\\0\\1\\0 \end{array} \right]$, $\left[ \begin{array} {c} 0\\0\\0\\1 \end{array} \right]$
K output units $h_{\Theta} \in \mathbb{R}^K$

cost function

$h_{\Theta} \in \mathbb{R}^K$
$(h_{\Theta}(x))_i = i^{th} \text{ output} $

$J(\Theta) = - \frac{1}{m}\left[ \sum_{i=1}^{m} \sum_{k=1}^K y_k^{(i)} \log (h_{\Theta}(x^{(i)}))_k + (1-y_k^{(i)}) \log (1-(h_{\Theta}(x^{(i)}))_k)\right] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{S_l} \sum_{j=1}^{S_{l+1}} (\Theta_{ji}^{(l)})^2$

backpropagation

to adjust $\Theta$ in order to minimize $J(\Theta)$

Need to compute:

$J(\Theta)$
$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$

Gradient computation: backpropagation algorithm

intuition: $\delta_j^{(l)}$ = “error” of node $j$ in layer $l$.

For each output unit (layer $L=4$)
$\delta_j^{(4)}=a_j^{(4)} - y_j$
where $a_j^{(4)} = (h_{\Theta}(x))_j$
vectorized : $\delta^{(4)}=a^{(4)} - y$

For other layer:
$\delta^{(3)}=(\Theta^{(3)})^T \delta^{(4)}.*g'(z^{(3)})$
$\delta^{(2)}=(\Theta^{(2)})^T \delta^{(3)}.*g'(z^{(2)})$
where $g’(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$

$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = \frac{1}{m} \sum_{t=1}^{m}a_j^{(t)(l)}\delta_i^{(t)(l+1)}$

backpropagation algorithm

Training set ${ (x^{(1)},y^{(1)}), … , (x^{(m)},y^{(m)}) }$

Set $\Delta_{ij}^{(l)} = 0 $, (for all $l,i,j$). (used to compute $\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$)
For $i=1 \text{ to } m$:
- Set $a^{(1)} = x^{(i)}$
- Perform forward propagation to compute $a^{(l)}$ for $l = 2,3,…,L$
- Using $y^{(i)}$, compute $\delta^{(L)} = a^{(L)}-y^{(i)}$
- Compute $\delta^{(L-1)}, \delta^{(L-2)},…, \delta^{(2)}$
- $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$
  - vectorized: $\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$
- $D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} \text{ if } j \neq 0$
- $D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} \text{ if } j = 0$

$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}$

backpropagation intuition

$\delta_j^{(l)}$ = “error” of cost for $a_j^{(l)}$ (unit $j$ in layer $l$ )
Formally, $\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i) \text{ for } j \ge 0$
E.g.: $\delta_2^{(2)} = {\Theta}_{12}^{(2)}\centerdot \delta_1^{(3)} + {\Theta}_{22}^{(2)}\centerdot \delta_2^{(3)}$

implementatin practice

unrolling parameters

Neural Network (L=4):

$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$ - matrices (Theta1, Theta2, Theta3)
$D^{(1)},D^{(2)},D^{(3)}$ -matrices (D1, D2, D3)
Unroll into vectors

Example

$s_1 = 10$, $s_2 = 10$, $s_3 = 1$
$\Theta^{(1)} \in \mathbb{R}^{10 \times 11}$, $\Theta^{(2)} \in \mathbb{R}^{10 \times 11}$, $\Theta^{(3)} \in \mathbb{R}^{1 \times 11}$
$D^{(1)} \in \mathbb{R}^{10 \times 11}$, $D^{(2)} \in \mathbb{R}^{10 \times 11}$, $D^{(3)} \in \mathbb{R}^{1 \times 11}$

% thetaVec is a 232 by 1 vector
thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];

Theta1 = reshape(thetaVec(1:110), 10, 11);

learning algorithm

Having initial parameters $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$
Unroll to get initialTheta to pass to fminunc(@costFunction, initialTheta, options)

function [jval, gradientVec] = costFunction(thetaVec)

From thetaVec, get $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$.
Use forward prop/back prop to compute $D^{(1)},D^{(2)},D^{(3)}$ and $J(\Theta)$
Unroll $D^{(1)},D^{(2)},D^{(3)}$ to get gradientVec.

gradient checking

Make sure the implementation is correct.

Numerical estimation of gradients:

$\theta \in \mathbb{R}$
$\epsilon$ is a small value like $10^{-4}$

\[\frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}\]

parameter vector $\theta$

$\theta \in \mathbb{R}^n$, the unrolled version
Can calculate approximate partial derivative of $J(\theta)$ with respect to each component in $\theta$

for i = 1:n
    thetaPlus = theta;
    thetaPlus(i) = theta(i) + EPSILON;
    thetaMinus = theta;
    thetaMinus(i) = theta(i) - EPSILON;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / 2 / EPSILON;

Calculate and check that $gradApprox \approx DVec$. DVec is from back prop.

Implementation

implement backprop to compute DVec (unrolled $D^{(1)},D^{(2)},D^{(3)}$ ).
Implement numerical gradient check to compute gradApprox.
Make sure they give similar values.
Turn off gradient checking. Using backprop code for learning.

important

be sure to disable your gradient checking code before training your classifier, If you run numerical gradient computation on every iteration of gradient descent it will be slow.

random initialization

initial value of $\Theta$

All zero is not a good choice. Some components of the same layer will be equal to each other.

symmetry breaking

initialize each $\Theta_{ij}^{(l)}$ to a random value in $[-\epsilon_{init}, \epsilon_{init}]$.

Good choice is $\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in}}+\sqrt{L_{out}}}$. It depends on number of input and out units in the layers.

Theta1 = rand(10,11)*2*INIT_EPSILON - INIT_EPSILON;

putting it together

No. of input units: Dimension of features $x^{(i)}$
No. of output units: Number of classes.
Reasonable default: 1 hidden layer, or if more than 1 hidden layer, have same no. of hidden units in every layer (usually the more hidden units the better)

Training a neural network

randomly initialize weights
implement forward propagation to get $h_{\Theta}(x^{(i)}) $ for any $x^{(i)}$
implement code to compute cost function $J(\Theta)$
implement backprop to compute partial derivatives $\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)$
use gradient checking to compare $\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)$ computed using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$.
- Then disable gradient checking code.
using gradient descent or advanced optimization method with backpropogation to try to minimize $J(\Theta)$ as a function of parameter $\Theta$

Use a for loop to go over all training examples, where performing forward propagation and back propagation (get activations $a^{(l)}$ and delta terms $\delta^{(l)} \text{ for } l = 2,…,L$ ).

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1

Stanford ML 5 Neural Networks Learning

cost function and backpropagation

cost function

two classification

cost function

backpropagation

Gradient computation: backpropagation algorithm

backpropagation algorithm

backpropagation intuition

implementatin practice

unrolling parameters

Example

learning algorithm

gradient checking

Numerical estimation of gradients:

parameter vector \(\theta\)

Implementation

important

random initialization

initial value of \(\Theta\)