# cost function and backpropagation

## cost function

L = total no. of layers in network.

\(S_l\) is no. of units (not counting bias unit) in layer \(l\)

### two classification

- binary classification

\(y = 0 \text{ or } 1\)

one output unit \(h_{\Theta} \in \mathbb{R}\) in output layer - multi-class classification (K class)

\(y \in \mathbb{R}^K\), E.g.: \(\left[ \begin{array} {c} 1\\0\\0\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\1\\0\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\0\\1\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\0\\0\\1 \end{array} \right]\)

K output units \(h_{\Theta} \in \mathbb{R}^K\)

### cost function

\(h_{\Theta} \in \mathbb{R}^K\)

\((h_{\Theta}(x))_i = i^{th} \text{ output} \)

## backpropagation

to adjust \(\Theta\) in order to minimize \(J(\Theta)\)

Need to compute:

- \(J(\Theta)\)
- \(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)\)

### Gradient computation: backpropagation algorithm

intuition: \(\delta_j^{(l)}\) = “error” of node \(j\) in layer \(l\).

For each output unit (layer \(L=4\))

\(\delta_j^{(4)}=a_j^{(4)} - y_j\)

where \(a_j^{(4)} = (h_{\Theta}(x))_j\)

vectorized : \(\delta^{(4)}=a^{(4)} - y\)

For other layer:

where \(g’(z^{(3)}) = a^{(3)}.*(1-a^{(3)})\)

\(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = \frac{1}{m} \sum_{t=1}^{m}a_j^{(t)(l)}\delta_i^{(t)(l+1)}\)

### backpropagation algorithm

Training set \({ (x^{(1)},y^{(1)}), … , (x^{(m)},y^{(m)}) }\)

- Set \(\Delta_{ij}^{(l)} = 0 \), (for all \(l,i,j\)). (used to compute \(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)\))
- For \(i=1 \text{ to } m\):
- Set \(a^{(1)} = x^{(i)}\)
- Perform forward propagation to compute \(a^{(l)}\) for \(l = 2,3,…,L\)
- Using \(y^{(i)}\), compute \(\delta^{(L)} = a^{(L)}-y^{(i)}\)
- Compute \(\delta^{(L-1)}, \delta^{(L-2)},…, \delta^{(2)}\)
- \(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}\)
- vectorized: \(\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T\)

- \(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} \text{ if } j \neq 0\)
- \(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} \text{ if } j = 0\)

\(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}\)

## backpropagation intuition

\(\delta_j^{(l)}\) = “error” of cost for \(a_j^{(l)}\) (unit \(j\) in layer \(l\) )

Formally, \(\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i) \text{ for } j \ge 0\)

E.g.:

# implementatin practice

## unrolling parameters

Neural Network (L=4):

\(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\) - matrices (Theta1, Theta2, Theta3)

\(D^{(1)},D^{(2)},D^{(3)}\) -matrices (D1, D2, D3)

Unroll into vectors

### Example

\(s_1 = 10\), \(s_2 = 10\), \(s_3 = 1\)

\(\Theta^{(1)} \in \mathbb{R}^{10 \times 11}\), \(\Theta^{(2)} \in \mathbb{R}^{10 \times 11}\), \(\Theta^{(3)} \in \mathbb{R}^{1 \times 11}\)

\(D^{(1)} \in \mathbb{R}^{10 \times 11}\), \(D^{(2)} \in \mathbb{R}^{10 \times 11}\), \(D^{(3)} \in \mathbb{R}^{1 \times 11}\)

```
% thetaVec is a 232 by 1 vector
thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];
Theta1 = reshape(thetaVec(1:110), 10, 11);
```

### learning algorithm

Having initial parameters \(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\)

Unroll to get `initialTheta`

to pass to `fminunc(@costFunction, initialTheta, options)`

`function [jval, gradientVec] = costFunction(thetaVec)`

- From
`thetaVec`

, get \(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\). - Use forward prop/back prop to compute \(D^{(1)},D^{(2)},D^{(3)}\) and \(J(\Theta)\)
- Unroll \(D^{(1)},D^{(2)},D^{(3)}\) to get
`gradientVec`

.

## gradient checking

Make sure the implementation is correct.

### Numerical estimation of gradients:

\(\theta \in \mathbb{R}\)

\(\epsilon\) is a small value like \(10^{-4}\)

\[\frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}\]

### parameter vector \(\theta\)

\(\theta \in \mathbb{R}^n\), the unrolled version

Can calculate approximate partial derivative of \(J(\theta)\) with respect to each component in \(\theta\)

```
for i = 1:n
thetaPlus = theta;
thetaPlus(i) = theta(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = theta(i) - EPSILON;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / 2 / EPSILON;
```

Calculate and check that \(gradApprox \approx DVec\). `DVec`

is from back prop.

### Implementation

- implement backprop to compute
`DVec`

(unrolled \(D^{(1)},D^{(2)},D^{(3)}\) ). - Implement numerical gradient check to compute
`gradApprox`

. - Make sure they give similar values.
- Turn off gradient checking. Using backprop code for learning.

### important

- be sure to disable your gradient checking code before training your classifier, If you run numerical gradient computation on every iteration of gradient descent it will be slow.

## random initialization

### initial value of \(\Theta\)

All zero is not a good choice. Some components of the same layer will be equal to each other.

### symmetry breaking

initialize each \(\Theta_{ij}^{(l)}\) to a random value in \([-\epsilon_{init}, \epsilon_{init}]\).

Good choice is \(\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in}}+\sqrt{L_{out}}}\). It depends on number of input and out units in the layers.

```
Theta1 = rand(10,11)*2*INIT_EPSILON - INIT_EPSILON;
```

## putting it together

No. of input units: Dimension of features \(x^{(i)}\)

No. of output units: Number of classes.

Reasonable default: 1 hidden layer, or if more than 1 hidden layer, have same no. of hidden units in every layer (usually the more hidden units the better)

### Training a neural network

- randomly initialize weights
- implement forward propagation to get \(h_{\Theta}(x^{(i)}) \) for any \(x^{(i)}\)
- implement code to compute cost function \(J(\Theta)\)
- implement backprop to compute partial derivatives \(\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)\)
- use gradient checking to compare \(\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)\) computed using backpropagation vs. using numerical estimate of gradient of \(J(\Theta)\).
- Then disable gradient checking code.

- using gradient descent or advanced optimization method with backpropogation to try to minimize \(J(\Theta)\) as a function of parameter \(\Theta\)

Use a for loop to go over all training examples, where performing forward propagation and back propagation (get activations \(a^{(l)}\) and delta terms \(\delta^{(l)} \text{ for } l = 2,…,L\) ).