2017-01-09

cost function and backpropagation

cost function

L = total no. of layers in network.
$$S_l$$ is no. of units (not counting bias unit) in layer $$l$$

two classification

1. binary classification
$$y = 0 \text{ or } 1$$
one output unit $$h_{\Theta} \in \mathbb{R}$$ in output layer
2. multi-class classification (K class)
$$y \in \mathbb{R}^K$$, E.g.: $$\left[ \begin{array} {c} 1\\0\\0\\0 \end{array} \right]$$, $$\left[ \begin{array} {c} 0\\1\\0\\0 \end{array} \right]$$, $$\left[ \begin{array} {c} 0\\0\\1\\0 \end{array} \right]$$, $$\left[ \begin{array} {c} 0\\0\\0\\1 \end{array} \right]$$
K output units $$h_{\Theta} \in \mathbb{R}^K$$

cost function

$$h_{\Theta} \in \mathbb{R}^K$$
$$(h_{\Theta}(x))_i = i^{th} \text{ output}$$

backpropagation

to adjust $$\Theta$$ in order to minimize $$J(\Theta)$$

Need to compute:

• $$J(\Theta)$$
• $$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$$

intuition: $$\delta_j^{(l)}$$ = “error” of node $$j$$ in layer $$l$$.

For each output unit (layer $$L=4$$)
$$\delta_j^{(4)}=a_j^{(4)} - y_j$$
where $$a_j^{(4)} = (h_{\Theta}(x))_j$$
vectorized : $$\delta^{(4)}=a^{(4)} - y$$

For other layer:
$\delta^{(3)}=(\Theta^{(3)})^T \delta^{(4)}.*g'(z^{(3)})$
$\delta^{(2)}=(\Theta^{(2)})^T \delta^{(3)}.*g'(z^{(2)})$
where $$g’(z^{(3)}) = a^{(3)}.*(1-a^{(3)})$$

$$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = \frac{1}{m} \sum_{t=1}^{m}a_j^{(t)(l)}\delta_i^{(t)(l+1)}$$

backpropagation algorithm

Training set $${ (x^{(1)},y^{(1)}), … , (x^{(m)},y^{(m)}) }$$

1. Set $$\Delta_{ij}^{(l)} = 0$$, (for all $$l,i,j$$). (used to compute $$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$$)
2. For $$i=1 \text{ to } m$$:
• Set $$a^{(1)} = x^{(i)}$$
• Perform forward propagation to compute $$a^{(l)}$$ for $$l = 2,3,…,L$$
• Using $$y^{(i)}$$, compute $$\delta^{(L)} = a^{(L)}-y^{(i)}$$
• Compute $$\delta^{(L-1)}, \delta^{(L-2)},…, \delta^{(2)}$$
• $$\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$$
• vectorized: $$\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$$
• $$D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} \text{ if } j \neq 0$$
• $$D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} \text{ if } j = 0$$

$$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}$$

backpropagation intuition

$$\delta_j^{(l)}$$ = “error” of cost for $$a_j^{(l)}$$ (unit $$j$$ in layer $$l$$ )
Formally, $$\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i) \text{ for } j \ge 0$$
E.g.: $\delta_2^{(2)} = {\Theta}_{12}^{(2)}\centerdot \delta_1^{(3)} + {\Theta}_{22}^{(2)}\centerdot \delta_2^{(3)}$

implementatin practice

unrolling parameters

Neural Network (L=4):

$$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$$ - matrices (Theta1, Theta2, Theta3)
$$D^{(1)},D^{(2)},D^{(3)}$$ -matrices (D1, D2, D3)
Unroll into vectors

Example

$$s_1 = 10$$, $$s_2 = 10$$, $$s_3 = 1$$
$$\Theta^{(1)} \in \mathbb{R}^{10 \times 11}$$, $$\Theta^{(2)} \in \mathbb{R}^{10 \times 11}$$, $$\Theta^{(3)} \in \mathbb{R}^{1 \times 11}$$
$$D^{(1)} \in \mathbb{R}^{10 \times 11}$$, $$D^{(2)} \in \mathbb{R}^{10 \times 11}$$, $$D^{(3)} \in \mathbb{R}^{1 \times 11}$$

% thetaVec is a 232 by 1 vector
thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];

Theta1 = reshape(thetaVec(1:110), 10, 11);


learning algorithm

Having initial parameters $$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$$
Unroll to get initialTheta to pass to fminunc(@costFunction, initialTheta, options)

function [jval, gradientVec] = costFunction(thetaVec)

1. From thetaVec, get $$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$$.
2. Use forward prop/back prop to compute $$D^{(1)},D^{(2)},D^{(3)}$$ and $$J(\Theta)$$
3. Unroll $$D^{(1)},D^{(2)},D^{(3)}$$ to get gradientVec.

Make sure the implementation is correct.

$$\theta \in \mathbb{R}$$
$$\epsilon$$ is a small value like $$10^{-4}$$

$\frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}$

parameter vector $$\theta$$

$$\theta \in \mathbb{R}^n$$, the unrolled version
Can calculate approximate partial derivative of $$J(\theta)$$ with respect to each component in $$\theta$$

for i = 1:n
thetaPlus = theta;
thetaPlus(i) = theta(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = theta(i) - EPSILON;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / 2 / EPSILON;


Calculate and check that $$gradApprox \approx DVec$$. DVec is from back prop.

Implementation

• implement backprop to compute DVec (unrolled $$D^{(1)},D^{(2)},D^{(3)}$$ ).
• Implement numerical gradient check to compute gradApprox.
• Make sure they give similar values.
• Turn off gradient checking. Using backprop code for learning.

important

• be sure to disable your gradient checking code before training your classifier, If you run numerical gradient computation on every iteration of gradient descent it will be slow.

random initialization

initial value of $$\Theta$$

All zero is not a good choice. Some components of the same layer will be equal to each other.

symmetry breaking

initialize each $$\Theta_{ij}^{(l)}$$ to a random value in $$[-\epsilon_{init}, \epsilon_{init}]$$.

Good choice is $$\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in}}+\sqrt{L_{out}}}$$. It depends on number of input and out units in the layers.

Theta1 = rand(10,11)*2*INIT_EPSILON - INIT_EPSILON;


putting it together

No. of input units: Dimension of features $$x^{(i)}$$
No. of output units: Number of classes.
Reasonable default: 1 hidden layer, or if more than 1 hidden layer, have same no. of hidden units in every layer (usually the more hidden units the better)

Training a neural network

1. randomly initialize weights
2. implement forward propagation to get $$h_{\Theta}(x^{(i)})$$ for any $$x^{(i)}$$
3. implement code to compute cost function $$J(\Theta)$$
4. implement backprop to compute partial derivatives $$\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)$$
5. use gradient checking to compare $$\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)$$ computed using backpropagation vs. using numerical estimate of gradient of $$J(\Theta)$$.
• Then disable gradient checking code.
6. using gradient descent or advanced optimization method with backpropogation to try to minimize $$J(\Theta)$$ as a function of parameter $$\Theta$$

Use a for loop to go over all training examples, where performing forward propagation and back propagation (get activations $$a^{(l)}$$ and delta terms $$\delta^{(l)} \text{ for } l = 2,…,L$$ ).