# History

Period | Technique |
---|---|

1980 | MLP-bp |

1990 | SVM & RBF |

2000 | SVM, Liquid State Machine |

2006 | Deep learning |

2010 | Deep learning RNN |

2017 | DL+RNN+Reinforcement |

# Learning in MLP

## Supervised learning in MLP

- Recursive technique of stochastic approximation, e.g. bp
- Design of nnet as a curve-fitting (approximation) problem, e.g. RBF

## Curve-fitting

- Finding a surface in a multidimensional space that provides a best fit to the training data
- “Best fit” measure in a certain statistical sense
- RBF is an example

# RBF networks

Has 3 layers:

- Input
- Hidden: nonlinear transformation from input to hidden
- Output: linear activation

## Cover’s theorem

Pattern classification casted in high-dimensional space is more likely to be linear separable than in low-dimensional space.

Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input. Once it’s done, separating hyperplane can be found simply and quickly.

input patterns in -dimensional space.

The inputs belong to either of two sets and : form a dichotomy.

The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class from .

Nonlinear mapping to hidden: For each , define an vector, :

A dichotomy is separable if there exists an D vector such that:

\[w^T \phi(x)>0, x\in \mathcal{X}_1\]

\[w^T \phi(x) < 0, x\in \mathcal{X}_2\]

denote the probability that a particular dichotomy picked at random is separable, where the family of surfaces has DOF.

\[P(N,m_1) = \left(\frac{1}{2} \right)^{N-1} \sum_{m=0}^{m_1-1}\left( \begin{array}{c} N-1 \\ m \end{array} \right)\]

where:

\[\left( \begin{array}{c} l \\ m \end{array} \right) = \frac{l!}{m!(l-m)!}\]

Corollary: A maximum of patterns can be linearly separated by a hidden space of -D.

Gaussian function: .

# RBF and Interpolation

input to output So:

Given and labels , find such that

\[F(x_i)=d_i, \text{ for all } i\]

## Interpolation

\[F(x)= \sum_{i=1}^N w_i\phi(\Vert x-x_i\Vert)\]

Or:

\[F(x)= \sum_{i=1}^N w_i\phi_i(x)\]

where is a set of nonlinear functions known as RBF.

The known data points are treated as the center of RBFs.

Vecterized:

\[\mathbf{\phi}^{N \times N} \mathbf{w}^{N \times 1} = \mathbf{d}^{N \times 1}\]

where

When No. hidden units :

\[w = \left( \phi^T \phi\right)^{-1} \phi^T d\]

Note is an rectangular matrix.

## Typocal RBFs

### Multiquatrics (non-local)

\[\phi(r) = (r^2+c^2)^{1/2}\]

### Inverse multiquatrics (local)

\[\phi(r) = (r^2+c^2)^{-1/2}\]

### Gaussian (local)

\[\phi(r) = \exp \left(-\frac{r^2}{2\sigma^2} \right)\]

# Ill-posed vs Well-posed

From domain to , want to construct mapping .

Treating supervised learning as an ill-posed prblem, and using certain prior knowledge (regularization), we can derive RBF.

## Well-posed

- Existence:
- Uniqueness
- Continuity: mapping

# Regularization

Main idea: stabilize the solution by means of prior info.

## Tikonov’s Regularization Theory

Similar to RBF.

Standard error term:

Regularization term:

where is a linear differential operator.

Minimize:

\[\mathcal{E}(F) = \mathcal{E}_s(F) + \lambda \mathcal{E}_c(F)\]

### Solution:

Euler-Lagrange equation.

\[F_{\lambda}(x) = \frac{1}{\lambda} \sum_{i=1}^{N}(d_i - F_{\lambda}(x_i)) G(x,x_i)\]

\[F_{\lambda}(x_j) =\sum_{i=1}^{N}w_i G(x_j,x_i)\]

Vectorized:

\[w = (G+\lambda I)^{-1}d\]

where is invertible while positive definite, ensured by a large

\[F_{\lambda} = G w\]

#### Step:

- Determine
- Find Green’s function matrix associated with
- Obtain weight by

\[w = (G+\lambda I)^{-1}d\]

An example of :

\[G(x,x_i)=\exp \left(-\frac{1}{2\sigma_i^2} \Vert x-x_i \Vert^2\right)\]

The associated is called universal approximator.

# Generalized RBF Networks

Having hidden units.

Minimize:

Solution:

\[(G^T G+\lambda G_0)w=G^T d\]

As approaches 0, we get:

\[w = (G^T G)^{-1} G^T d = G^+ d\]

## Weighted Norm

When individual input lines in the input vector are from different classes.

\[\Vert x \Vert_C^2=(Cx)^T(Cx)=x^T C^T C x\]

Approximation function:

\[F^*(x)=\sum_{i=1}^{m_1} w_i G(\Vert x-t_i \Vert_C)\]

Gaussian RBF:

\[G(\Vert x-t_i \Vert_C) = \exp\left(-\frac{1}{2}(x-t_i)^T C^T C (x-t_i)\right) \\ = \exp\left(-\frac{1}{2}(x-t_i)^T \Sigma^{-1}(x-t_i)\right)\]

is the covariance matrix of a multivariate Gaussian function.

Generalized RBF needs to determine:

- weights
- the RBF centers
- the norm weighting matrix

Regularization networks known RBF centers (same as all the inputs), and only the weights need to be determined.

## Estimating the Parameters

- Weights , discussed before.
- Regularization parameter :
- Minimize averaged squared error
- Use generalized cross-validation

- RBF centers:
- Randomly select fixed centers
- Self-organized selection
- Supervised selection

### Estimating the Regularization Parameter

Minimize average squared error:

For a fixed , for all inputs, calculate the squared error between the true function value and the estimated RBF network output using the . Find the optimal that minimizes this error.

Generalized cross-validation:

Only on the training set, use leave-one-out cross validation.

## RBF learning

Basic idea is to learn in two different time scales:

- Nonlinear, slow learning of the RBF parameters (center, variance)
- Linear, fast learning of the hidden-to-output weights

RBF centers are picked by random from the available inputs .

RBF std is fixed to:

\[\sigma = \frac{d_{max}}{\sqrt{2m_1}}\]

where is the max distance between the chosen centers .

Linear weights can be learned:

\[w= G^+ d = (G^T G)^{-1} G^T d\]

where the matrix ,

### Finding with SVD

SVD: singular value decomposition

For a real matrix , there exists orthogonal matrices:

\[U = [u_1,u_2,…,u_N]\]

\[V = [v_1,v_2,…,v_M]\]

Such that:

\[U^T G V = diag(\sigma_1,\sigma_2,…,\sigma_K) = \Sigma; K=\min(M,N)\]

is left singular matrix, is the right singular matrix, and is the singular values of the matrix .

\[G^+ = V \Sigma^+ U^T\]

where

Properties:

\[U^{-1} = U^T\]

\[V^{-1} = V^T\]

\[\Sigma \Sigma^+ = I\]

\[G G^+ = I\]