cost function and backpropagation
cost function
L = total no. of layers in network.
\(S_l\) is no. of units (not counting bias unit) in layer \(l\)
two classification
- binary classification
\(y = 0 \text{ or } 1\)
one output unit \(h_{\Theta} \in \mathbb{R}\) in output layer - multi-class classification (K class)
\(y \in \mathbb{R}^K\), E.g.: \(\left[ \begin{array} {c} 1\\0\\0\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\1\\0\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\0\\1\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\0\\0\\1 \end{array} \right]\)
K output units \(h_{\Theta} \in \mathbb{R}^K\)
cost function
\(h_{\Theta} \in \mathbb{R}^K\)
\((h_{\Theta}(x))_i = i^{th} \text{ output} \)
backpropagation
to adjust \(\Theta\) in order to minimize \(J(\Theta)\)
Need to compute:
- \(J(\Theta)\)
- \(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)\)
Gradient computation: backpropagation algorithm
intuition: \(\delta_j^{(l)}\) = “error” of node \(j\) in layer \(l\).
For each output unit (layer \(L=4\))
\(\delta_j^{(4)}=a_j^{(4)} - y_j\)
where \(a_j^{(4)} = (h_{\Theta}(x))_j\)
vectorized : \(\delta^{(4)}=a^{(4)} - y\)
For other layer:
where \(g’(z^{(3)}) = a^{(3)}.*(1-a^{(3)})\)
\(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = \frac{1}{m} \sum_{t=1}^{m}a_j^{(t)(l)}\delta_i^{(t)(l+1)}\)
backpropagation algorithm
Training set \({ (x^{(1)},y^{(1)}), … , (x^{(m)},y^{(m)}) }\)
- Set \(\Delta_{ij}^{(l)} = 0 \), (for all \(l,i,j\)). (used to compute \(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)\))
- For \(i=1 \text{ to } m\):
- Set \(a^{(1)} = x^{(i)}\)
- Perform forward propagation to compute \(a^{(l)}\) for \(l = 2,3,…,L\)
- Using \(y^{(i)}\), compute \(\delta^{(L)} = a^{(L)}-y^{(i)}\)
- Compute \(\delta^{(L-1)}, \delta^{(L-2)},…, \delta^{(2)}\)
- \(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}\)
- vectorized: \(\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T\)
- \(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} \text{ if } j \neq 0\)
- \(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} \text{ if } j = 0\)
\(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}\)
backpropagation intuition
\(\delta_j^{(l)}\) = “error” of cost for \(a_j^{(l)}\) (unit \(j\) in layer \(l\) )
Formally, \(\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i) \text{ for } j \ge 0\)
E.g.:
implementatin practice
unrolling parameters
Neural Network (L=4):
\(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\) - matrices (Theta1, Theta2, Theta3)
\(D^{(1)},D^{(2)},D^{(3)}\) -matrices (D1, D2, D3)
Unroll into vectors
Example
\(s_1 = 10\), \(s_2 = 10\), \(s_3 = 1\)
\(\Theta^{(1)} \in \mathbb{R}^{10 \times 11}\), \(\Theta^{(2)} \in \mathbb{R}^{10 \times 11}\), \(\Theta^{(3)} \in \mathbb{R}^{1 \times 11}\)
\(D^{(1)} \in \mathbb{R}^{10 \times 11}\), \(D^{(2)} \in \mathbb{R}^{10 \times 11}\), \(D^{(3)} \in \mathbb{R}^{1 \times 11}\)
% thetaVec is a 232 by 1 vector
thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];
Theta1 = reshape(thetaVec(1:110), 10, 11);
learning algorithm
Having initial parameters \(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\)
Unroll to get initialTheta
to pass to fminunc(@costFunction, initialTheta, options)
function [jval, gradientVec] = costFunction(thetaVec)
- From
thetaVec
, get \(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\). - Use forward prop/back prop to compute \(D^{(1)},D^{(2)},D^{(3)}\) and \(J(\Theta)\)
- Unroll \(D^{(1)},D^{(2)},D^{(3)}\) to get
gradientVec
.
gradient checking
Make sure the implementation is correct.
Numerical estimation of gradients:
\(\theta \in \mathbb{R}\)
\(\epsilon\) is a small value like \(10^{-4}\)
\[\frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}\]
parameter vector \(\theta\)
\(\theta \in \mathbb{R}^n\), the unrolled version
Can calculate approximate partial derivative of \(J(\theta)\) with respect to each component in \(\theta\)
for i = 1:n
thetaPlus = theta;
thetaPlus(i) = theta(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = theta(i) - EPSILON;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / 2 / EPSILON;
Calculate and check that \(gradApprox \approx DVec\). DVec
is from back prop.
Implementation
- implement backprop to compute
DVec
(unrolled \(D^{(1)},D^{(2)},D^{(3)}\) ). - Implement numerical gradient check to compute
gradApprox
. - Make sure they give similar values.
- Turn off gradient checking. Using backprop code for learning.
important
- be sure to disable your gradient checking code before training your classifier, If you run numerical gradient computation on every iteration of gradient descent it will be slow.
random initialization
initial value of \(\Theta\)
All zero is not a good choice. Some components of the same layer will be equal to each other.
symmetry breaking
initialize each \(\Theta_{ij}^{(l)}\) to a random value in \([-\epsilon_{init}, \epsilon_{init}]\).
Good choice is \(\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in}}+\sqrt{L_{out}}}\). It depends on number of input and out units in the layers.
Theta1 = rand(10,11)*2*INIT_EPSILON - INIT_EPSILON;
putting it together
No. of input units: Dimension of features \(x^{(i)}\)
No. of output units: Number of classes.
Reasonable default: 1 hidden layer, or if more than 1 hidden layer, have same no. of hidden units in every layer (usually the more hidden units the better)
Training a neural network
- randomly initialize weights
- implement forward propagation to get \(h_{\Theta}(x^{(i)}) \) for any \(x^{(i)}\)
- implement code to compute cost function \(J(\Theta)\)
- implement backprop to compute partial derivatives \(\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)\)
- use gradient checking to compare \(\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)\) computed using backpropagation vs. using numerical estimate of gradient of \(J(\Theta)\).
- Then disable gradient checking code.
- using gradient descent or advanced optimization method with backpropogation to try to minimize \(J(\Theta)\) as a function of parameter \(\Theta\)
Use a for loop to go over all training examples, where performing forward propagation and back propagation (get activations \(a^{(l)}\) and delta terms \(\delta^{(l)} \text{ for } l = 2,…,L\) ).