History

Period	Technique
1980	MLP-bp
1990	SVM & RBF
2000	SVM, Liquid State Machine
2006	Deep learning
2010	Deep learning RNN
2017	DL+RNN+Reinforcement

Learning in MLP

Supervised learning in MLP

Recursive technique of stochastic approximation, e.g. bp
Design of nnet as a curve-fitting (approximation) problem, e.g. RBF

Curve-fitting

Finding a surface in a multidimensional space that provides a best fit to the training data
“Best fit” measure in a certain statistical sense
RBF is an example

RBF networks

Has 3 layers:

Input
Hidden: nonlinear transformation from input to hidden
Output: linear activation

Cover’s theorem

Pattern classification casted in high-dimensional space is more likely to be linear separable than in low-dimensional space.

Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input. Once it’s done, separating hyperplane can be found simply and quickly.

$N$ input patterns $\mathcal{X}=\{x_1,x_2,...,x_N\}$ in $m_0$ -dimensional space.
The inputs belong to either of two sets $\mathcal{X}_1$ and $\mathcal{X}_2$ : form a dichotomy.
The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class $\mathcal{X}_1$ from $\mathcal{X}_2$ .

Nonlinear mapping to hidden: For each $x \in \mathcal{X}$ , define an $m_1-$ vector, $m_1 > m_0$ : $\phi(x) = [\phi_1(x),...,\phi_{m_1}(x)]^T$

A dichotomy is $\phi -$ separable if there exists an $m_1-$ D vector $w$ such that:
\[w^T \phi(x)>0, x\in \mathcal{X}_1\]
\[w^T \phi(x) < 0, x\in \mathcal{X}_2\]

$P(N,m_1)$ denote the probability that a particular dichotomy picked at random is $\phi -$ separable, where the family of surfaces has $m_1$ DOF.

\[P(N,m_1) = \left(\frac{1}{2} \right)^{N-1} \sum_{m=0}^{m_1-1}\left( \begin{array}{c} N-1 \\ m \end{array} \right)\]
where:
\[\left( \begin{array}{c} l \\ m \end{array} \right) = \frac{l!}{m!(l-m)!}\]

Corollary: A maximum of $2m_1$ patterns can be linearly separated by a hidden space of $m_1$ -D.

Gaussian function: $\phi_i(x) = \exp(-\Vert x-t_i \Vert^2)$ .

RBF and Interpolation

$m_0-D$ input to $1-D$ output So:
Given $\{x_i \in \mathbb{R}^{m_0} \vert i=1,2,...,N\}$ and $N$ labels $\{d_i \in \mathbb{R} \vert i=1,2,...,N\}$ , find $F: \mathbb{R}^N \rightarrow \mathbb{R}$ such that
\[F(x_i)=d_i, \text{ for all } i\]

Interpolation

\[F(x)= \sum_{i=1}^N w_i\phi(\Vert x-x_i\Vert)\]
Or:
\[F(x)= \sum_{i=1}^N w_i\phi_i(x)\]

where $\{\phi_i(x)=\phi(\Vert x-x_i\Vert) \vert i = 1,2,...N\}$ is a set of $N$ nonlinear functions known as RBF.
The known data points are treated as the center of RBFs.

Vecterized:

\[\mathbf{\phi}^{N \times N} \mathbf{w}^{N \times 1} = \mathbf{d}^{N \times 1}\]
where $\phi_{ji} = \phi( \Vert \underbrace{x_j}_{\text{input}} - \underbrace{x_i}_{\text{center}} \Vert)$

When No. hidden units $m_0 < N$ :
\[w = \left( \phi^T \phi\right)^{-1} \phi^T d\]

Note $\phi$ is an $N \times m_0$ rectangular matrix.

Typocal RBFs

Multiquatrics (non-local)

\[\phi(r) = (r^2+c^2)^{1/2}\]

Inverse multiquatrics (local)

\[\phi(r) = (r^2+c^2)^{-1/2}\]

Gaussian (local)

\[\phi(r) = \exp \left(-\frac{r^2}{2\sigma^2} \right)\]

Ill-posed vs Well-posed

From domain $\mathcal{X}$ to $\mathcal{Y}$ , want to construct mapping $f$ .
Treating supervised learning as an ill-posed prblem, and using certain prior knowledge (regularization), we can derive RBF.

Well-posed

Existence: $y = f(x) \vert x \in \mathcal{X}, y \in \mathcal{Y}$
Uniqueness
Continuity: mapping

Regularization

Main idea: stabilize the solution by means of prior info.

Tikonov’s Regularization Theory

Similar to RBF.

Standard error term:

$\mathcal{E}_s (F) = \frac{1}{2} \sum_{i=1}^{N}(d_i - F(x_i))^2$

Regularization term:

$\mathcal{E}_c(F) = \frac{1}{2} \Vert D F \Vert^2$

where $D$ is a linear differential operator.

Minimize:
\[\mathcal{E}(F) = \mathcal{E}_s(F) + \lambda \mathcal{E}_c(F)\]

Solution:

Euler-Lagrange equation.

\[F_{\lambda}(x) = \frac{1}{\lambda} \sum_{i=1}^{N}(d_i - F_{\lambda}(x_i)) G(x,x_i)\]
\[F_{\lambda}(x_j) =\sum_{i=1}^{N}w_i G(x_j,x_i)\]

Vectorized:
\[w = (G+\lambda I)^{-1}d\]
where $G+\lambda I$ is invertible while positive definite, ensured by a large $\lambda$
\[F_{\lambda} = G w\]

Step:

Determine $D$
Find Green’s function matrix $G$ associated with $D$
Obtain weight $w$ by

\[w = (G+\lambda I)^{-1}d\]

An example of $G$ :
\[G(x,x_i)=\exp \left(-\frac{1}{2\sigma_i^2} \Vert x-x_i \Vert^2\right)\]
The associated $F_{\lambda}(x)$ is called universal approximator.

Generalized RBF Networks

Having $m_1$ hidden units.

Minimize:

$\mathcal{E}(F^*) = \sum_{j=1}^{N}\left(d_j -\sum_{i=1}^{N} w_i G(\Vert x_j-x_i \Vert)\right)^2 +\lambda \Vert D F^* \Vert^2$

Solution:
\[(G^T G+\lambda G_0)w=G^T d\]
As $\lambda$ approaches 0, we get:
\[w = (G^T G)^{-1} G^T d = G^+ d\]

Weighted Norm

When individual input lines in the input vector $x$ are from different classes.
\[\Vert x \Vert_C^2=(Cx)^T(Cx)=x^T C^T C x\]
Approximation function:
\[F^*(x)=\sum_{i=1}^{m_1} w_i G(\Vert x-t_i \Vert_C)\]
Gaussian RBF:
\[G(\Vert x-t_i \Vert_C) = \exp\left(-\frac{1}{2}(x-t_i)^T C^T C (x-t_i)\right) \\ = \exp\left(-\frac{1}{2}(x-t_i)^T \Sigma^{-1}(x-t_i)\right)\]

$\Sigma$ is the covariance matrix of a multivariate Gaussian function.

Generalized RBF needs to determine:

weights
the RBF centers $t_i$
the norm weighting matrix

Regularization networks known RBF centers (same as all the inputs), and only the weights need to be determined.

Estimating the Parameters

Weights $w_i$ , discussed before.
Regularization parameter :
- Minimize averaged squared error
- Use generalized cross-validation
RBF centers:
- Randomly select fixed centers
- Self-organized selection
- Supervised selection

Estimating the Regularization Parameter

Minimize average squared error:
For a fixed $\lambda$ , for all $N$ inputs, calculate the squared error between the true function value and the estimated RBF network output using the $\lambda$ . Find the optimal $\lambda$ that minimizes this error.

Generalized cross-validation:
Only on the training set, use leave-one-out cross validation.

RBF learning

Basic idea is to learn in two different time scales:

Nonlinear, slow learning of the RBF parameters (center, variance)
Linear, fast learning of the hidden-to-output weights

RBF centers $t_i$ are picked by random from the available inputs $x_j$ .
RBF std is fixed to:
\[\sigma = \frac{d_{max}}{\sqrt{2m_1}}\]
where $d_{max}$ is the max distance between the chosen centers $t_i$ .

Linear weights can be learned:
\[w= G^+ d = (G^T G)^{-1} G^T d\]
where the matrix $G = \{g_{ji}\}$ , $g_{ji}=G(\Vert x_j -x_i \Vert^2)$

Finding $G^+$ with SVD

SVD: singular value decomposition

For a real $N \times M$ matrix $G$ , there exists orthogonal matrices:
\[U = [u_1,u_2,…,u_N]\]
\[V = [v_1,v_2,…,v_M]\]
Such that:
\[U^T G V = diag(\sigma_1,\sigma_2,…,\sigma_K) = \Sigma; K=\min(M,N)\]

$U$ is left singular matrix, $V$ is the right singular matrix, and $\sigma_1,\sigma_2,...,\sigma_K$ is the singular values of the matrix $G$ .

\[G^+ = V \Sigma^+ U^T\]
where $\Sigma^+ = diag \left( \frac{1}{\sigma_1},\frac{1}{\sigma_2},...,\frac{1}{\sigma_K}\right)$

Properties:
\[U^{-1} = U^T\]
\[V^{-1} = V^T\]
\[\Sigma \Sigma^+ = I\]
\[G G^+ = I\]

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1

TAMU Neural Network 5 Radial-Basis Function Networks