2017-03-02

# History

Period Technique
1980 MLP-bp
1990 SVM & RBF
2000 SVM, Liquid State Machine
2006 Deep learning
2010 Deep learning RNN
2017 DL+RNN+Reinforcement

# Learning in MLP

## Supervised learning in MLP

• Recursive technique of stochastic approximation, e.g. bp
• Design of nnet as a curve-fitting (approximation) problem, e.g. RBF

## Curve-fitting

• Finding a surface in a multidimensional space that provides a best fit to the training data
• “Best fit” measure in a certain statistical sense
• RBF is an example

# RBF networks

Has 3 layers:

• Input
• Hidden: nonlinear transformation from input to hidden
• Output: linear activation

## Cover’s theorem

Pattern classification casted in high-dimensional space is more likely to be linear separable than in low-dimensional space.

Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input. Once it’s done, separating hyperplane can be found simply and quickly.

$N$ input patterns $\mathcal{X}=\{x_1,x_2,...,x_N\}$ in $m_0$-dimensional space.
The inputs belong to either of two sets $\mathcal{X}_1$ and $\mathcal{X}_2$: form a dichotomy.
The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class $\mathcal{X}_1$ from $\mathcal{X}_2$.

Nonlinear mapping to hidden: For each $x \in \mathcal{X}$, define an $m_1-$ vector, $m_1 > m_0$: $\phi(x) = [\phi_1(x),...,\phi_{m_1}(x)]^T$

A dichotomy is $\phi -$ separable if there exists an $m_1-$D vector $w$ such that:
$w^T \phi(x)>0, x\in \mathcal{X}_1$
$w^T \phi(x) < 0, x\in \mathcal{X}_2$

$P(N,m_1)$ denote the probability that a particular dichotomy picked at random is $\phi -$ separable, where the family of surfaces has $m_1$ DOF.

$P(N,m_1) = \left(\frac{1}{2} \right)^{N-1} \sum_{m=0}^{m_1-1}\left( \begin{array}{c} N-1 \\ m \end{array} \right)$
where:
$\left( \begin{array}{c} l \\ m \end{array} \right) = \frac{l!}{m!(l-m)!}$

Corollary: A maximum of $2m_1$ patterns can be linearly separated by a hidden space of $m_1$-D.

Gaussian function: $\phi_i(x) = \exp(-\Vert x-t_i \Vert^2)$.

# RBF and Interpolation

$m_0-D$ input to $1-D$ output So:
Given $\{x_i \in \mathbb{R}^{m_0} \vert i=1,2,...,N\}$ and $N$ labels $\{d_i \in \mathbb{R} \vert i=1,2,...,N\}$, find $F: \mathbb{R}^N \rightarrow \mathbb{R}$ such that
$F(x_i)=d_i, \text{ for all } i$

## Interpolation

$F(x)= \sum_{i=1}^N w_i\phi(\Vert x-x_i\Vert)$
Or:
$F(x)= \sum_{i=1}^N w_i\phi_i(x)$

where $\{\phi_i(x)=\phi(\Vert x-x_i\Vert) \vert i = 1,2,...N\}$ is a set of $N$ nonlinear functions known as RBF.
The known data points are treated as the center of RBFs.

Vecterized:

$\mathbf{\phi}^{N \times N} \mathbf{w}^{N \times 1} = \mathbf{d}^{N \times 1}$
where $\phi_{ji} = \phi( \Vert \underbrace{x_j}_{\text{input}} - \underbrace{x_i}_{\text{center}} \Vert)$

When No. hidden units $% $:
$w = \left( \phi^T \phi\right)^{-1} \phi^T d$

Note $\phi$ is an $N \times m_0$ rectangular matrix.

## Typocal RBFs

### Multiquatrics (non-local)

$\phi(r) = (r^2+c^2)^{1/2}$

### Inverse multiquatrics (local)

$\phi(r) = (r^2+c^2)^{-1/2}$

### Gaussian (local)

$\phi(r) = \exp \left(-\frac{r^2}{2\sigma^2} \right)$

# Ill-posed vs Well-posed

From domain $\mathcal{X}$ to $\mathcal{Y}$, want to construct mapping $f$.
Treating supervised learning as an ill-posed prblem, and using certain prior knowledge (regularization), we can derive RBF.

## Well-posed

• Existence: $y = f(x) \vert x \in \mathcal{X}, y \in \mathcal{Y}$
• Uniqueness
• Continuity: mapping

# Regularization

Main idea: stabilize the solution by means of prior info.

## Tikonov’s Regularization Theory

Similar to RBF.

Standard error term:

Regularization term:

where $D$ is a linear differential operator.

Minimize:
$\mathcal{E}(F) = \mathcal{E}_s(F) + \lambda \mathcal{E}_c(F)$

### Solution:

Euler-Lagrange equation.

$F_{\lambda}(x) = \frac{1}{\lambda} \sum_{i=1}^{N}(d_i - F_{\lambda}(x_i)) G(x,x_i)$
$F_{\lambda}(x_j) =\sum_{i=1}^{N}w_i G(x_j,x_i)$

Vectorized:
$w = (G+\lambda I)^{-1}d$
where $G+\lambda I$ is invertible while positive definite, ensured by a large $\lambda$
$F_{\lambda} = G w$

#### Step:

1. Determine $D$
2. Find Green’s function matrix $G$ associated with $D$
3. Obtain weight $w$ by

$w = (G+\lambda I)^{-1}d$

An example of $G$:
$G(x,x_i)=\exp \left(-\frac{1}{2\sigma_i^2} \Vert x-x_i \Vert^2\right)$
The associated $F_{\lambda}(x)$ is called universal approximator.

# Generalized RBF Networks

Having $m_1$ hidden units.

Minimize:

Solution:
$(G^T G+\lambda G_0)w=G^T d$
As $\lambda$ approaches 0, we get:
$w = (G^T G)^{-1} G^T d = G^+ d$

## Weighted Norm

When individual input lines in the input vector $x$ are from different classes.
$\Vert x \Vert_C^2=(Cx)^T(Cx)=x^T C^T C x$
Approximation function:
$F^*(x)=\sum_{i=1}^{m_1} w_i G(\Vert x-t_i \Vert_C)$
Gaussian RBF:
$G(\Vert x-t_i \Vert_C) = \exp\left(-\frac{1}{2}(x-t_i)^T C^T C (x-t_i)\right) \\ = \exp\left(-\frac{1}{2}(x-t_i)^T \Sigma^{-1}(x-t_i)\right)$

$\Sigma$ is the covariance matrix of a multivariate Gaussian function.

Generalized RBF needs to determine:

• weights
• the RBF centers $t_i$
• the norm weighting matrix

Regularization networks known RBF centers (same as all the inputs), and only the weights need to be determined.

## Estimating the Parameters

• Weights $w_i$, discussed before.
• Regularization parameter $\lambda$:
• Minimize averaged squared error
• Use generalized cross-validation
• RBF centers:
• Randomly select fixed centers
• Self-organized selection
• Supervised selection

### Estimating the Regularization Parameter

Minimize average squared error:
For a fixed $\lambda$, for all $N$ inputs, calculate the squared error between the true function value and the estimated RBF network output using the $\lambda$. Find the optimal $\lambda$ that minimizes this error.

Generalized cross-validation:
Only on the training set, use leave-one-out cross validation.

## RBF learning

Basic idea is to learn in two different time scales:

• Nonlinear, slow learning of the RBF parameters (center, variance)
• Linear, fast learning of the hidden-to-output weights

RBF centers $t_i$ are picked by random from the available inputs $x_j$.
RBF std is fixed to:
$\sigma = \frac{d_{max}}{\sqrt{2m_1}}$
where $d_{max}$ is the max distance between the chosen centers $t_i$.

Linear weights can be learned:
$w= G^+ d = (G^T G)^{-1} G^T d$
where the matrix $G = \{g_{ji}\}$, $g_{ji}=G(\Vert x_j -x_i \Vert^2)$

### Finding $G^+$ with SVD

SVD: singular value decomposition

For a real $N \times M$ matrix $G$, there exists orthogonal matrices:
$U = [u_1,u_2,…,u_N]$
$V = [v_1,v_2,…,v_M]$
Such that:
$U^T G V = diag(\sigma_1,\sigma_2,…,\sigma_K) = \Sigma; K=\min(M,N)$

$U$ is left singular matrix, $V$ is the right singular matrix, and $\sigma_1,\sigma_2,...,\sigma_K$ is the singular values of the matrix $G$.

$G^+ = V \Sigma^+ U^T$
where $\Sigma^+ = diag \left( \frac{1}{\sigma_1},\frac{1}{\sigma_2},...,\frac{1}{\sigma_K}\right)$

Properties:
$U^{-1} = U^T$
$V^{-1} = V^T$
$\Sigma \Sigma^+ = I$
$G G^+ = I$