History
Period | Technique |
---|---|
1980 | MLP-bp |
1990 | SVM & RBF |
2000 | SVM, Liquid State Machine |
2006 | Deep learning |
2010 | Deep learning RNN |
2017 | DL+RNN+Reinforcement |
Learning in MLP
Supervised learning in MLP
- Recursive technique of stochastic approximation, e.g. bp
- Design of nnet as a curve-fitting (approximation) problem, e.g. RBF
Curve-fitting
- Finding a surface in a multidimensional space that provides a best fit to the training data
- “Best fit” measure in a certain statistical sense
- RBF is an example
RBF networks
Has 3 layers:
- Input
- Hidden: nonlinear transformation from input to hidden
- Output: linear activation
Cover’s theorem
Pattern classification casted in high-dimensional space is more likely to be linear separable than in low-dimensional space.
Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input. Once it’s done, separating hyperplane can be found simply and quickly.
input patterns in -dimensional space.
The inputs belong to either of two sets and : form a dichotomy.
The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class from .
Nonlinear mapping to hidden: For each , define an vector, :
A dichotomy is separable if there exists an D vector such that:
\[w^T \phi(x)>0, x\in \mathcal{X}_1\]
\[w^T \phi(x) < 0, x\in \mathcal{X}_2\]
denote the probability that a particular dichotomy picked at random is separable, where the family of surfaces has DOF.
\[P(N,m_1) = \left(\frac{1}{2} \right)^{N-1} \sum_{m=0}^{m_1-1}\left( \begin{array}{c} N-1 \\ m \end{array} \right)\]
where:
\[\left( \begin{array}{c} l \\ m \end{array} \right) = \frac{l!}{m!(l-m)!}\]
Corollary: A maximum of patterns can be linearly separated by a hidden space of -D.
Gaussian function: .
RBF and Interpolation
input to output So:
Given and labels , find such that
\[F(x_i)=d_i, \text{ for all } i\]
Interpolation
\[F(x)= \sum_{i=1}^N w_i\phi(\Vert x-x_i\Vert)\]
Or:
\[F(x)= \sum_{i=1}^N w_i\phi_i(x)\]
where is a set of nonlinear functions known as RBF.
The known data points are treated as the center of RBFs.
Vecterized:
\[\mathbf{\phi}^{N \times N} \mathbf{w}^{N \times 1} = \mathbf{d}^{N \times 1}\]
where
When No. hidden units :
\[w = \left( \phi^T \phi\right)^{-1} \phi^T d\]
Note is an rectangular matrix.
Typocal RBFs
Multiquatrics (non-local)
\[\phi(r) = (r^2+c^2)^{1/2}\]
Inverse multiquatrics (local)
\[\phi(r) = (r^2+c^2)^{-1/2}\]
Gaussian (local)
\[\phi(r) = \exp \left(-\frac{r^2}{2\sigma^2} \right)\]
Ill-posed vs Well-posed
From domain to , want to construct mapping .
Treating supervised learning as an ill-posed prblem, and using certain prior knowledge (regularization), we can derive RBF.
Well-posed
- Existence:
- Uniqueness
- Continuity: mapping
Regularization
Main idea: stabilize the solution by means of prior info.
Tikonov’s Regularization Theory
Similar to RBF.
Standard error term:
Regularization term:
where is a linear differential operator.
Minimize:
\[\mathcal{E}(F) = \mathcal{E}_s(F) + \lambda \mathcal{E}_c(F)\]
Solution:
Euler-Lagrange equation.
\[F_{\lambda}(x) = \frac{1}{\lambda} \sum_{i=1}^{N}(d_i - F_{\lambda}(x_i)) G(x,x_i)\]
\[F_{\lambda}(x_j) =\sum_{i=1}^{N}w_i G(x_j,x_i)\]
Vectorized:
\[w = (G+\lambda I)^{-1}d\]
where is invertible while positive definite, ensured by a large
\[F_{\lambda} = G w\]
Step:
- Determine
- Find Green’s function matrix associated with
- Obtain weight by
\[w = (G+\lambda I)^{-1}d\]
An example of :
\[G(x,x_i)=\exp \left(-\frac{1}{2\sigma_i^2} \Vert x-x_i \Vert^2\right)\]
The associated is called universal approximator.
Generalized RBF Networks
Having hidden units.
Minimize:
Solution:
\[(G^T G+\lambda G_0)w=G^T d\]
As approaches 0, we get:
\[w = (G^T G)^{-1} G^T d = G^+ d\]
Weighted Norm
When individual input lines in the input vector are from different classes.
\[\Vert x \Vert_C^2=(Cx)^T(Cx)=x^T C^T C x\]
Approximation function:
\[F^*(x)=\sum_{i=1}^{m_1} w_i G(\Vert x-t_i \Vert_C)\]
Gaussian RBF:
\[G(\Vert x-t_i \Vert_C) = \exp\left(-\frac{1}{2}(x-t_i)^T C^T C (x-t_i)\right) \\ = \exp\left(-\frac{1}{2}(x-t_i)^T \Sigma^{-1}(x-t_i)\right)\]
is the covariance matrix of a multivariate Gaussian function.
Generalized RBF needs to determine:
- weights
- the RBF centers
- the norm weighting matrix
Regularization networks known RBF centers (same as all the inputs), and only the weights need to be determined.
Estimating the Parameters
- Weights , discussed before.
- Regularization parameter :
- Minimize averaged squared error
- Use generalized cross-validation
- RBF centers:
- Randomly select fixed centers
- Self-organized selection
- Supervised selection
Estimating the Regularization Parameter
Minimize average squared error:
For a fixed , for all inputs, calculate the squared error between the true function value and the estimated RBF network output using the . Find the optimal that minimizes this error.
Generalized cross-validation:
Only on the training set, use leave-one-out cross validation.
RBF learning
Basic idea is to learn in two different time scales:
- Nonlinear, slow learning of the RBF parameters (center, variance)
- Linear, fast learning of the hidden-to-output weights
RBF centers are picked by random from the available inputs .
RBF std is fixed to:
\[\sigma = \frac{d_{max}}{\sqrt{2m_1}}\]
where is the max distance between the chosen centers .
Linear weights can be learned:
\[w= G^+ d = (G^T G)^{-1} G^T d\]
where the matrix ,
Finding with SVD
SVD: singular value decomposition
For a real matrix , there exists orthogonal matrices:
\[U = [u_1,u_2,…,u_N]\]
\[V = [v_1,v_2,…,v_M]\]
Such that:
\[U^T G V = diag(\sigma_1,\sigma_2,…,\sigma_K) = \Sigma; K=\min(M,N)\]
is left singular matrix, is the right singular matrix, and is the singular values of the matrix .
\[G^+ = V \Sigma^+ U^T\]
where
Properties:
\[U^{-1} = U^T\]
\[V^{-1} = V^T\]
\[\Sigma \Sigma^+ = I\]
\[G G^+ = I\]