# Introduction

Networks consist of input, hidden and output layers.

Popular learning algorithm is the **error backpropagation algorithm**.

- Forward pass: activate the network, layer by layer
- Backward pass: error signal backpropagates from output to hidden and hidden to input.

## Activation function

sigmoid:

\[\phi(v)=\frac{1}{1+\exp(-v)}\]

\[\frac{d\phi(v)}{dv} = \phi(v) (1-\phi(v))\]

other:

\[\tanh(v) = \frac{1 - \exp(-2v)}{1 + \exp(-2v)}\]

# Backpropagation

## Error Gradient

For input-output pairs :

\[\frac{\partial E}{\partial w_i} = \frac{\partial }{\partial w_i} \left(\frac{1}{2} \sum_k (d_k - y_k)^2\right) \\ \hspace{0.6 cm} = -\sum_k (d_k-y_k) \frac{\partial y_k}{\partial v_k}\frac{\partial v_k}{\partial w_i} \\ = - \sum_k (d_k-y_k) \centerdot y_k (1-y_k) \centerdot x_{i,k}\]

## Algorithm

Initialize all weights to small random numbers.

Until satisfied, do:

- output unit
- Error:

- hidden unit
- Error:

- update each network weight

Weight is defined as .

## Derivation of

\[\Delta w_{ji} = - \eta \frac{\partial E}{\partial w_{ji}} \\ = - \eta \frac{\partial E}{\partial v_{j}} \frac{\partial v_{j}}{\partial w_{ji}} \\ = \eta \left[ y_j (1-y_j) \sum_{k \in Downstream(j)} \delta_k w_{kj}\right]x_i\]

## Learning rate and momentum

Introduce momentum to overcome problem caused by smaller and larger learning rate.

\[\Delta w_{ji}(n) = \eta \delta_j(n)y_i (n) + \alpha \Delta w_{ji}(n-1)\]

\[\Delta w_{ji}(n) = - \eta \sum_{t=0}^{n} \alpha^{n-t} \frac{E(t)}{w_{ji}(t)}\]

- When successive take the same sign: weight update is accelerated (speed up downhill)
- When successive take the same sign: weight update is damped (stabilize oscillation)

# Sequential (online) vs. Batch

- sequential
- help avoid local minima
- hard to establish theoretical convergence

- batch
- accurate estimate of the gradient

# Overfitting

- Training set error and validation set error
- Early stopping ensures good performance on unobserved samples.
- Solution: weight decay, use of validation sets, use of k-fold cross-validation

# Recurrent Networks

- Sequence recognition
- Store tree structure
- Can be trained with plain bp
- Represent a stack (push & pop)

# Universal approximation theorem

MLP can be seen as performing nonlinear input-output mapping.

For a case with one hidden layer:

\[F(x_1,…,x_{m_0})=\sum_{i=1}^{m_1} \alpha_i \phi \left( \sum_{j=1}^{m_0} w_{ij}x_j +b_i\right)\]

To have:

\[\vert F(x_1,…,x_{m_0}) - F(x_1,…,x_{m_0})\vert \lt \epsilon\]

- Imply that one hidden layer is sufficient, but may not be optimal.

# Generalization

Smoothness in the mapping is desired, and this is related to criteria like Occam’s razor.

Affected by 3 factors:

- Size of the training set, and how representative they are.
- The architecture of the network
- Physical complexity of the problem

Sample complexity and VC dimension are related.

# Heuristic for Accelerating Convergence

- Separate learning rate for each tunable weight
- Allow learning rate to be adjusted

## Making BP better

- Sequential update works well for large and highly redundant data sets.
- Maximization of info content
- use examples with largest error
- use examples radically different from those previously used
- shuffle input order
- present more difficult inputs

- Activation function: Usually learning is faster with antisymmetric activation functions. . e.g.
- Target values, values within the range of the sigmoid activation function.
- Normalize the inputs
- Random Initialization
- Learning from hints
- Learning rates
- hidden layer’s rates should be higher than output layer
- small rates for neurons with many inputs

# Optimization Problem

Derive the update rule using Taylor series expansion.

\[\mathcal{E}_{av}(w(n)+\Delta w(n))\]

## First order

\[\Delta w(n) = - \eta g(n)\]

## Second order

Newton’s method (quadratic approx)

\[\Delta w^*(n) = - \mathbf{H}^{-1} g(n)\]

Limitations:

- expensive computation to Hessian
- Hessian needs to be nonsingular, which is rare
- convergence is not guaranteed if the cost function is non-quadratic

## Conjugate gradient method

Conjugate gradient method overcomes the above problems.

**Conjugate vectors**: Given an matrix A, nonzero vectors is A-conjugate if

\[s^T(n)AS(j)=0 \mbox{ for all } n \ne j\]

Square root of the matrix A is: .

And

Given any conjugate vector and ,

\[x_i^T A x_j=\left(A^{1/2} x_i\right)^T A^{1/2} x_j = 0\]

So transform any (nonzero) vector into:

\[v_i = A^{1/2} x_i\]

Results in vectors that are mutually orthogonal. They can form a basis to span the vector space.

### Conjugate Direction Method

For a quadratic error function

\[x(n+1)=x(n)+\eta(n)s(n), n=0,1,…,W-1\]

where is an arbitrary initial vector and is defined by

\[f(x(n)+\eta(n)s(n)) = \min\limits_{\eta} f(x(n)+\eta(n)s(n))\]

Line search:

\[\eta(n) = - \frac{s^T(n)A \epsilon(n)}{s^T(n)A s(n)}\]

where is the error vector.

But

For each iteration:

\[x(n+1) = \arg \min\limits_{x \in \mathcal{D}_n} f(x)\]

where D is spanned by the A-conjugate vectors.

### Conjugate Gradient Method

Residual:

\[r(n)=b-Ax(n)\]

Recursive step to get conjugate vectors:

\[s(n) = r(n)+\beta(n) s(n-1), n=0,1,…,W-1\]

where scaling factor is determined as

\[\beta(n) = - \frac{s^T(n-1)A r(n)}{s^T(n-1)A s(n)}\]

We need to know A. But by using Polak-Ribiere formula and Fletcher-Reeves formula, we can evaluate based only on the resuduals.

Use line search to estimate , can be achieved by inverse parabolic approx

- Bracketing phase
- Sectioning phase