Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Stanford ML 5 Neural Networks Learning

2017-01-09

cost function and backpropagation

cost function

L = total no. of layers in network.
\(S_l\) is no. of units (not counting bias unit) in layer \(l\)

two classification

  1. binary classification
    \(y = 0 \text{ or } 1\)
    one output unit \(h_{\Theta} \in \mathbb{R}\) in output layer
  2. multi-class classification (K class)
    \(y \in \mathbb{R}^K\), E.g.: \(\left[ \begin{array} {c} 1\\0\\0\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\1\\0\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\0\\1\\0 \end{array} \right]\), \(\left[ \begin{array} {c} 0\\0\\0\\1 \end{array} \right]\)
    K output units \(h_{\Theta} \in \mathbb{R}^K\)

cost function

\(h_{\Theta} \in \mathbb{R}^K\)
\((h_{\Theta}(x))_i = i^{th} \text{ output} \)

backpropagation

to adjust \(\Theta\) in order to minimize \(J(\Theta)\)

Need to compute:

  • \(J(\Theta)\)
  • \(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)\)

Gradient computation: backpropagation algorithm

intuition: \(\delta_j^{(l)}\) = “error” of node \(j\) in layer \(l\).

For each output unit (layer \(L=4\))
\(\delta_j^{(4)}=a_j^{(4)} - y_j\)
where \(a_j^{(4)} = (h_{\Theta}(x))_j\)
vectorized : \(\delta^{(4)}=a^{(4)} - y\)

For other layer:


where \(g’(z^{(3)}) = a^{(3)}.*(1-a^{(3)})\)

\(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = \frac{1}{m} \sum_{t=1}^{m}a_j^{(t)(l)}\delta_i^{(t)(l+1)}\)

backpropagation algorithm

Training set \({ (x^{(1)},y^{(1)}), … , (x^{(m)},y^{(m)}) }\)

  1. Set \(\Delta_{ij}^{(l)} = 0 \), (for all \(l,i,j\)). (used to compute \(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)\))
  2. For \(i=1 \text{ to } m\):
    • Set \(a^{(1)} = x^{(i)}\)
    • Perform forward propagation to compute \(a^{(l)}\) for \(l = 2,3,…,L\)
    • Using \(y^{(i)}\), compute \(\delta^{(L)} = a^{(L)}-y^{(i)}\)
    • Compute \(\delta^{(L-1)}, \delta^{(L-2)},…, \delta^{(2)}\)
    • \(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}\)
      • vectorized: \(\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T\)
    • \(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} \text{ if } j \neq 0\)
    • \(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} \text{ if } j = 0\)

\(\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}\)

backpropagation intuition

\(\delta_j^{(l)}\) = “error” of cost for \(a_j^{(l)}\) (unit \(j\) in layer \(l\) )
Formally, \(\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i) \text{ for } j \ge 0\)
E.g.:

implementatin practice

unrolling parameters

Neural Network (L=4):

\(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\) - matrices (Theta1, Theta2, Theta3)
\(D^{(1)},D^{(2)},D^{(3)}\) -matrices (D1, D2, D3)
Unroll into vectors

Example

\(s_1 = 10\), \(s_2 = 10\), \(s_3 = 1\)
\(\Theta^{(1)} \in \mathbb{R}^{10 \times 11}\), \(\Theta^{(2)} \in \mathbb{R}^{10 \times 11}\), \(\Theta^{(3)} \in \mathbb{R}^{1 \times 11}\)
\(D^{(1)} \in \mathbb{R}^{10 \times 11}\), \(D^{(2)} \in \mathbb{R}^{10 \times 11}\), \(D^{(3)} \in \mathbb{R}^{1 \times 11}\)

% thetaVec is a 232 by 1 vector
thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];

Theta1 = reshape(thetaVec(1:110), 10, 11);

learning algorithm

Having initial parameters \(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\)
Unroll to get initialTheta to pass to fminunc(@costFunction, initialTheta, options)

function [jval, gradientVec] = costFunction(thetaVec)

  1. From thetaVec, get \(\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\).
  2. Use forward prop/back prop to compute \(D^{(1)},D^{(2)},D^{(3)}\) and \(J(\Theta)\)
  3. Unroll \(D^{(1)},D^{(2)},D^{(3)}\) to get gradientVec.

gradient checking

Make sure the implementation is correct.

Numerical estimation of gradients:

\(\theta \in \mathbb{R}\)
\(\epsilon\) is a small value like \(10^{-4}\)

\[\frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}\]

parameter vector \(\theta\)

\(\theta \in \mathbb{R}^n\), the unrolled version
Can calculate approximate partial derivative of \(J(\theta)\) with respect to each component in \(\theta\)

for i = 1:n
    thetaPlus = theta;
    thetaPlus(i) = theta(i) + EPSILON;
    thetaMinus = theta;
    thetaMinus(i) = theta(i) - EPSILON;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / 2 / EPSILON;

Calculate and check that \(gradApprox \approx DVec\). DVec is from back prop.

Implementation

  • implement backprop to compute DVec (unrolled \(D^{(1)},D^{(2)},D^{(3)}\) ).
  • Implement numerical gradient check to compute gradApprox.
  • Make sure they give similar values.
  • Turn off gradient checking. Using backprop code for learning.

important

  • be sure to disable your gradient checking code before training your classifier, If you run numerical gradient computation on every iteration of gradient descent it will be slow.

random initialization

initial value of \(\Theta\)

All zero is not a good choice. Some components of the same layer will be equal to each other.

symmetry breaking

initialize each \(\Theta_{ij}^{(l)}\) to a random value in \([-\epsilon_{init}, \epsilon_{init}]\).

Good choice is \(\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in}}+\sqrt{L_{out}}}\). It depends on number of input and out units in the layers.

Theta1 = rand(10,11)*2*INIT_EPSILON - INIT_EPSILON;

putting it together

No. of input units: Dimension of features \(x^{(i)}\)
No. of output units: Number of classes.
Reasonable default: 1 hidden layer, or if more than 1 hidden layer, have same no. of hidden units in every layer (usually the more hidden units the better)

Training a neural network

  1. randomly initialize weights
  2. implement forward propagation to get \(h_{\Theta}(x^{(i)}) \) for any \(x^{(i)}\)
  3. implement code to compute cost function \(J(\Theta)\)
  4. implement backprop to compute partial derivatives \(\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)\)
  5. use gradient checking to compare \(\frac{\partial}{\partial \Theta_{jk}^{(l)}} J(\Theta)\) computed using backpropagation vs. using numerical estimate of gradient of \(J(\Theta)\).
    • Then disable gradient checking code.
  6. using gradient descent or advanced optimization method with backpropogation to try to minimize \(J(\Theta)\) as a function of parameter \(\Theta\)

Use a for loop to go over all training examples, where performing forward propagation and back propagation (get activations \(a^{(l)}\) and delta terms \(\delta^{(l)} \text{ for } l = 2,…,L\) ).


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2025. All rights reserved by melonskin. Powered by Jekyll.