Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Neural Network 5 Radial-Basis Function Networks

2017-03-02

History

Period Technique
1980 MLP-bp
1990 SVM & RBF
2000 SVM, Liquid State Machine
2006 Deep learning
2010 Deep learning RNN
2017 DL+RNN+Reinforcement

Learning in MLP

Supervised learning in MLP

  • Recursive technique of stochastic approximation, e.g. bp
  • Design of nnet as a curve-fitting (approximation) problem, e.g. RBF

Curve-fitting

  • Finding a surface in a multidimensional space that provides a best fit to the training data
  • “Best fit” measure in a certain statistical sense
  • RBF is an example

RBF networks

Has 3 layers:

  • Input
  • Hidden: nonlinear transformation from input to hidden
  • Output: linear activation

Cover’s theorem

Pattern classification casted in high-dimensional space is more likely to be linear separable than in low-dimensional space.

Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input. Once it’s done, separating hyperplane can be found simply and quickly.

input patterns in -dimensional space.
The inputs belong to either of two sets and : form a dichotomy.
The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class from .

Nonlinear mapping to hidden: For each , define an vector, :

A dichotomy is separable if there exists an D vector such that:
\[w^T \phi(x)>0, x\in \mathcal{X}_1\]
\[w^T \phi(x) < 0, x\in \mathcal{X}_2\]

denote the probability that a particular dichotomy picked at random is separable, where the family of surfaces has DOF.

\[P(N,m_1) = \left(\frac{1}{2} \right)^{N-1} \sum_{m=0}^{m_1-1}\left( \begin{array}{c} N-1 \\ m \end{array} \right)\]
where:
\[\left( \begin{array}{c} l \\ m \end{array} \right) = \frac{l!}{m!(l-m)!}\]

Corollary: A maximum of patterns can be linearly separated by a hidden space of -D.

Gaussian function: .

RBF and Interpolation

input to output So:
Given and labels , find such that
\[F(x_i)=d_i, \text{ for all } i\]

Interpolation

\[F(x)= \sum_{i=1}^N w_i\phi(\Vert x-x_i\Vert)\]
Or:
\[F(x)= \sum_{i=1}^N w_i\phi_i(x)\]

where is a set of nonlinear functions known as RBF.
The known data points are treated as the center of RBFs.

Vecterized:

\[\mathbf{\phi}^{N \times N} \mathbf{w}^{N \times 1} = \mathbf{d}^{N \times 1}\]
where

When No. hidden units :
\[w = \left( \phi^T \phi\right)^{-1} \phi^T d\]

Note is an rectangular matrix.

Typocal RBFs

Multiquatrics (non-local)

\[\phi(r) = (r^2+c^2)^{1/2}\]

Inverse multiquatrics (local)

\[\phi(r) = (r^2+c^2)^{-1/2}\]

Gaussian (local)

\[\phi(r) = \exp \left(-\frac{r^2}{2\sigma^2} \right)\]

Ill-posed vs Well-posed

From domain to , want to construct mapping .
Treating supervised learning as an ill-posed prblem, and using certain prior knowledge (regularization), we can derive RBF.

Well-posed

  • Existence:
  • Uniqueness
  • Continuity: mapping

Regularization

Main idea: stabilize the solution by means of prior info.

Tikonov’s Regularization Theory

Similar to RBF.

Standard error term:

Regularization term:

where is a linear differential operator.

Minimize:
\[\mathcal{E}(F) = \mathcal{E}_s(F) + \lambda \mathcal{E}_c(F)\]

Solution:

Euler-Lagrange equation.

\[F_{\lambda}(x) = \frac{1}{\lambda} \sum_{i=1}^{N}(d_i - F_{\lambda}(x_i)) G(x,x_i)\]
\[F_{\lambda}(x_j) =\sum_{i=1}^{N}w_i G(x_j,x_i)\]

Vectorized:
\[w = (G+\lambda I)^{-1}d\]
where is invertible while positive definite, ensured by a large
\[F_{\lambda} = G w\]

Step:

  1. Determine
  2. Find Green’s function matrix associated with
  3. Obtain weight by

\[w = (G+\lambda I)^{-1}d\]

An example of :
\[G(x,x_i)=\exp \left(-\frac{1}{2\sigma_i^2} \Vert x-x_i \Vert^2\right)\]
The associated is called universal approximator.

Generalized RBF Networks

Having hidden units.

Minimize:

Solution:
\[(G^T G+\lambda G_0)w=G^T d\]
As approaches 0, we get:
\[w = (G^T G)^{-1} G^T d = G^+ d\]

Weighted Norm

When individual input lines in the input vector are from different classes.
\[\Vert x \Vert_C^2=(Cx)^T(Cx)=x^T C^T C x\]
Approximation function:
\[F^*(x)=\sum_{i=1}^{m_1} w_i G(\Vert x-t_i \Vert_C)\]
Gaussian RBF:
\[G(\Vert x-t_i \Vert_C) = \exp\left(-\frac{1}{2}(x-t_i)^T C^T C (x-t_i)\right) \\ = \exp\left(-\frac{1}{2}(x-t_i)^T \Sigma^{-1}(x-t_i)\right)\]

is the covariance matrix of a multivariate Gaussian function.

Generalized RBF needs to determine:

  • weights
  • the RBF centers
  • the norm weighting matrix

Regularization networks known RBF centers (same as all the inputs), and only the weights need to be determined.

Estimating the Parameters

  • Weights , discussed before.
  • Regularization parameter :
    • Minimize averaged squared error
    • Use generalized cross-validation
  • RBF centers:
    • Randomly select fixed centers
    • Self-organized selection
    • Supervised selection

Estimating the Regularization Parameter

Minimize average squared error:
For a fixed , for all inputs, calculate the squared error between the true function value and the estimated RBF network output using the . Find the optimal that minimizes this error.

Generalized cross-validation:
Only on the training set, use leave-one-out cross validation.

RBF learning

Basic idea is to learn in two different time scales:

  • Nonlinear, slow learning of the RBF parameters (center, variance)
  • Linear, fast learning of the hidden-to-output weights

RBF centers are picked by random from the available inputs .
RBF std is fixed to:
\[\sigma = \frac{d_{max}}{\sqrt{2m_1}}\]
where is the max distance between the chosen centers .

Linear weights can be learned:
\[w= G^+ d = (G^T G)^{-1} G^T d\]
where the matrix ,

Finding with SVD

SVD: singular value decomposition

For a real matrix , there exists orthogonal matrices:
\[U = [u_1,u_2,…,u_N]\]
\[V = [v_1,v_2,…,v_M]\]
Such that:
\[U^T G V = diag(\sigma_1,\sigma_2,…,\sigma_K) = \Sigma; K=\min(M,N)\]

is left singular matrix, is the right singular matrix, and is the singular values of the matrix .

\[G^+ = V \Sigma^+ U^T\]
where

Properties:
\[U^{-1} = U^T\]
\[V^{-1} = V^T\]
\[\Sigma \Sigma^+ = I\]
\[G G^+ = I\]


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.