2017-01-09

# cost function and backpropagation

## cost function

L = total no. of layers in network.
$$S_l$$ is no. of units (not counting bias unit) in layer $$l$$

### two classification

1. binary classification
$$y = 0 \text{ or } 1$$
one output unit $$h_{\Theta} \in \mathbb{R}$$ in output layer
2. multi-class classification (K class)
$$y \in \mathbb{R}^K$$, E.g.: $$\left[ \begin{array} {c} 1\\0\\0\\0 \end{array} \right]$$, $$\left[ \begin{array} {c} 0\\1\\0\\0 \end{array} \right]$$, $$\left[ \begin{array} {c} 0\\0\\1\\0 \end{array} \right]$$, $$\left[ \begin{array} {c} 0\\0\\0\\1 \end{array} \right]$$
K output units $$h_{\Theta} \in \mathbb{R}^K$$

### cost function

$$h_{\Theta} \in \mathbb{R}^K$$
$$(h_{\Theta}(x))_i = i^{th} \text{ output}$$

## backpropagation

to adjust $$\Theta$$ in order to minimize $$J(\Theta)$$

Need to compute:

• $$J(\Theta)$$
• $$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$$

intuition: $$\delta_j^{(l)}$$ = “error” of node $$j$$ in layer $$l$$.

For each output unit (layer $$L=4$$)
$$\delta_j^{(4)}=a_j^{(4)} - y_j$$
where $$a_j^{(4)} = (h_{\Theta}(x))_j$$
vectorized : $$\delta^{(4)}=a^{(4)} - y$$

For other layer:
$\delta^{(3)}=(\Theta^{(3)})^T \delta^{(4)}.*g'(z^{(3)})$
$\delta^{(2)}=(\Theta^{(2)})^T \delta^{(3)}.*g'(z^{(2)})$
where $$g’(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$$

$$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = \frac{1}{m} \sum_{t=1}^{m}a_j^{(t)(l)}\delta_i^{(t)(l+1)}$$

### backpropagation algorithm

Training set $${ (x^{(1)},y^{(1)}), … , (x^{(m)},y^{(m)}) }$$

1. Set $$\Delta_{ij}^{(l)} = 0$$, (for all $$l,i,j$$). (used to compute $$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$$)
2. For $$i=1 \text{ to } m$$:
• Set $$a^{(1)} = x^{(i)}$$
• Perform forward propagation to compute $$a^{(l)}$$ for $$l = 2,3,…,L$$
• Using $$y^{(i)}$$, compute $$\delta^{(L)} = a^{(L)}-y^{(i)}$$
• Compute $$\delta^{(L-1)}, \delta^{(L-2)},…, \delta^{(2)}$$
• $$\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$$
• vectorized: $$\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$$
• $$D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} \text{ if } j \neq 0$$
• $$D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} \text{ if } j = 0$$

$$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}$$

## backpropagation intuition

$$\delta_j^{(l)}$$ = “error” of cost for $$a_j^{(l)}$$ (unit $$j$$ in layer $$l$$ )
Formally, $$\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i) \text{ for } j \ge 0$$
E.g.: $\delta_2^{(2)} = {\Theta}_{12}^{(2)}\centerdot \delta_1^{(3)} + {\Theta}_{22}^{(2)}\centerdot \delta_2^{(3)}$

# implementatin practice

## unrolling parameters

Neural Network (L=4):

$$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$$ - matrices (Theta1, Theta2, Theta3)
$$D^{(1)},D^{(2)},D^{(3)}$$ -matrices (D1, D2, D3)
Unroll into vectors

### Example

$$s_1 = 10$$, $$s_2 = 10$$, $$s_3 = 1$$
$$\Theta^{(1)} \in \mathbb{R}^{10 \times 11}$$, $$\Theta^{(2)} \in \mathbb{R}^{10 \times 11}$$, $$\Theta^{(3)} \in \mathbb{R}^{1 \times 11}$$
$$D^{(1)} \in \mathbb{R}^{10 \times 11}$$, $$D^{(2)} \in \mathbb{R}^{10 \times 11}$$, $$D^{(3)} \in \mathbb{R}^{1 \times 11}$$

% thetaVec is a 232 by 1 vector
thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];

Theta1 = reshape(thetaVec(1:110), 10, 11);


### learning algorithm

Having initial parameters $$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$$
Unroll to get initialTheta to pass to fminunc(@costFunction, initialTheta, options)

function [jval, gradientVec] = costFunction(thetaVec)

1. From thetaVec, get $$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$$.
2. Use forward prop/back prop to compute $$D^{(1)},D^{(2)},D^{(3)}$$ and $$J(\Theta)$$
3. Unroll $$D^{(1)},D^{(2)},D^{(3)}$$ to get gradientVec.

Make sure the implementation is correct.

$$\theta \in \mathbb{R}$$
$$\epsilon$$ is a small value like $$10^{-4}$$

$\frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}$

### parameter vector $$\theta$$

$$\theta \in \mathbb{R}^n$$, the unrolled version
Can calculate approximate partial derivative of $$J(\theta)$$ with respect to each component in $$\theta$$

for i = 1:n
thetaPlus = theta;
thetaPlus(i) = theta(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = theta(i) - EPSILON;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / 2 / EPSILON;


Calculate and check that $$gradApprox \approx DVec$$. DVec is from back prop.

### Implementation

• implement backprop to compute DVec (unrolled $$D^{(1)},D^{(2)},D^{(3)}$$ ).
• Implement numerical gradient check to compute gradApprox.
• Make sure they give similar values.
• Turn off gradient checking. Using backprop code for learning.

### important

• be sure to disable your gradient checking code before training your classifier, If you run numerical gradient computation on every iteration of gradient descent it will be slow.

## random initialization

### initial value of $$\Theta$$

All zero is not a good choice. Some components of the same layer will be equal to each other.

### symmetry breaking

initialize each $$\Theta_{ij}^{(l)}$$ to a random value in $$[-\epsilon_{init}, \epsilon_{init}]$$.

Good choice is $$\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in}}+\sqrt{L_{out}}}$$. It depends on number of input and out units in the layers.

Theta1 = rand(10,11)*2*INIT_EPSILON - INIT_EPSILON;


## putting it together

No. of input units: Dimension of features $$x^{(i)}$$
No. of output units: Number of classes.
Reasonable default: 1 hidden layer, or if more than 1 hidden layer, have same no. of hidden units in every layer (usually the more hidden units the better)

### Training a neural network

1. randomly initialize weights
2. implement forward propagation to get $$h_{\Theta}(x^{(i)})$$ for any $$x^{(i)}$$
3. implement code to compute cost function $$J(\Theta)$$
4. implement backprop to compute partial derivatives $$\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)$$
5. use gradient checking to compare $$\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)$$ computed using backpropagation vs. using numerical estimate of gradient of $$J(\Theta)$$.
• Then disable gradient checking code.
6. using gradient descent or advanced optimization method with backpropogation to try to minimize $$J(\Theta)$$ as a function of parameter $$\Theta$$

Use a for loop to go over all training examples, where performing forward propagation and back propagation (get activations $$a^{(l)}$$ and delta terms $$\delta^{(l)} \text{ for } l = 2,…,L$$ ).