Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Neural Network 4 Multi-Layer Perceptrons

2017-02-28

Introduction

Networks consist of input, hidden and output layers.

Popular learning algorithm is the error backpropagation algorithm.

  • Forward pass: activate the network, layer by layer
  • Backward pass: error signal backpropagates from output to hidden and hidden to input.

Activation function

sigmoid:
\[\phi(v)=\frac{1}{1+\exp(-v)}\]
\[\frac{d\phi(v)}{dv} = \phi(v) (1-\phi(v))\]

other:
\[\tanh(v) = \frac{1 - \exp(-2v)}{1 + \exp(-2v)}\]

Backpropagation

Error Gradient

For input-output pairs :

\[\frac{\partial E}{\partial w_i} = \frac{\partial }{\partial w_i} \left(\frac{1}{2} \sum_k (d_k - y_k)^2\right) \\ \hspace{0.6 cm} = -\sum_k (d_k-y_k) \frac{\partial y_k}{\partial v_k}\frac{\partial v_k}{\partial w_i} \\ = - \sum_k (d_k-y_k) \centerdot y_k (1-y_k) \centerdot x_{i,k}\]

Algorithm

Initialize all weights to small random numbers.

Until satisfied, do:

  1. output unit
    • Error:
  2. hidden unit
    • Error:
  3. update each network weight

Weight is defined as .

Derivation of

\[\Delta w_{ji} = - \eta \frac{\partial E}{\partial w_{ji}} \\ = - \eta \frac{\partial E}{\partial v_{j}} \frac{\partial v_{j}}{\partial w_{ji}} \\ = \eta \left[ y_j (1-y_j) \sum_{k \in Downstream(j)} \delta_k w_{kj}\right]x_i\]

Learning rate and momentum

Introduce momentum to overcome problem caused by smaller and larger learning rate.

\[\Delta w_{ji}(n) = \eta \delta_j(n)y_i (n) + \alpha \Delta w_{ji}(n-1)\]
\[\Delta w_{ji}(n) = - \eta \sum_{t=0}^{n} \alpha^{n-t} \frac{E(t)}{w_{ji}(t)}\]

  • When successive take the same sign: weight update is accelerated (speed up downhill)
  • When successive take the same sign: weight update is damped (stabilize oscillation)

Sequential (online) vs. Batch

  • sequential
    • help avoid local minima
    • hard to establish theoretical convergence
  • batch
    • accurate estimate of the gradient

Overfitting

  • Training set error and validation set error
  • Early stopping ensures good performance on unobserved samples.
  • Solution: weight decay, use of validation sets, use of k-fold cross-validation

Recurrent Networks

  • Sequence recognition
  • Store tree structure
  • Can be trained with plain bp
  • Represent a stack (push & pop)

Universal approximation theorem

MLP can be seen as performing nonlinear input-output mapping.

For a case with one hidden layer:
\[F(x_1,…,x_{m_0})=\sum_{i=1}^{m_1} \alpha_i \phi \left( \sum_{j=1}^{m_0} w_{ij}x_j +b_i\right)\]

To have:
\[\vert F(x_1,…,x_{m_0}) - F(x_1,…,x_{m_0})\vert \lt \epsilon\]

  • Imply that one hidden layer is sufficient, but may not be optimal.

Generalization

Smoothness in the mapping is desired, and this is related to criteria like Occam’s razor.

Affected by 3 factors:

  • Size of the training set, and how representative they are.
  • The architecture of the network
  • Physical complexity of the problem

Sample complexity and VC dimension are related.

Heuristic for Accelerating Convergence

  • Separate learning rate for each tunable weight
  • Allow learning rate to be adjusted

Making BP better

  • Sequential update works well for large and highly redundant data sets.
  • Maximization of info content
    • use examples with largest error
    • use examples radically different from those previously used
    • shuffle input order
    • present more difficult inputs
  • Activation function: Usually learning is faster with antisymmetric activation functions. . e.g.
  • Target values, values within the range of the sigmoid activation function.
  • Normalize the inputs
  • Random Initialization
  • Learning from hints
  • Learning rates
    • hidden layer’s rates should be higher than output layer
    • small rates for neurons with many inputs

Optimization Problem

Derive the update rule using Taylor series expansion.
\[\mathcal{E}_{av}(w(n)+\Delta w(n))\]

First order

\[\Delta w(n) = - \eta g(n)\]

Second order

Newton’s method (quadratic approx)

\[\Delta w^*(n) = - \mathbf{H}^{-1} g(n)\]

Limitations:

  • expensive computation to Hessian
  • Hessian needs to be nonsingular, which is rare
  • convergence is not guaranteed if the cost function is non-quadratic

Conjugate gradient method

Conjugate gradient method overcomes the above problems.

Conjugate vectors: Given an matrix A, nonzero vectors is A-conjugate if
\[s^T(n)AS(j)=0 \mbox{ for all } n \ne j\]

Square root of the matrix A is: .
And

Given any conjugate vector and ,
\[x_i^T A x_j=\left(A^{1/2} x_i\right)^T A^{1/2} x_j = 0\]

So transform any (nonzero) vector into:
\[v_i = A^{1/2} x_i\]
Results in vectors that are mutually orthogonal. They can form a basis to span the vector space.

Conjugate Direction Method

For a quadratic error function
\[x(n+1)=x(n)+\eta(n)s(n), n=0,1,…,W-1\]
where is an arbitrary initial vector and is defined by
\[f(x(n)+\eta(n)s(n)) = \min\limits_{\eta} f(x(n)+\eta(n)s(n))\]

Line search:
\[\eta(n) = - \frac{s^T(n)A \epsilon(n)}{s^T(n)A s(n)}\]
where is the error vector.
But

For each iteration:
\[x(n+1) = \arg \min\limits_{x \in \mathcal{D}_n} f(x)\]
where D is spanned by the A-conjugate vectors.

Conjugate Gradient Method

Residual:
\[r(n)=b-Ax(n)\]
Recursive step to get conjugate vectors:
\[s(n) = r(n)+\beta(n) s(n-1), n=0,1,…,W-1\]
where scaling factor is determined as
\[\beta(n) = - \frac{s^T(n-1)A r(n)}{s^T(n-1)A s(n)}\]

We need to know A. But by using Polak-Ribiere formula and Fletcher-Reeves formula, we can evaluate based only on the resuduals.

Use line search to estimate , can be achieved by inverse parabolic approx

  • Bracketing phase
  • Sectioning phase

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.