Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Neural Network 9 Principal Component Analysis

2017-04-22

Motivation

Project given data so that the variance in the projected points is maximized.

Variance probe

  • is a m-dimentional random vector
    • vector random variable following a certain probability distribution
  • Assume
  • Project a unit vector on ; get
  • Variance:

We would have a variance probe: . With different unit vectors will result in smaller or larger variance in the projected points.

To find the solution where the variance has extremal value, solve for the unit vectors satisfying:

\[\mathbf{Rq}=\lambda\mathbf{q}\]

Where is a scaling factor. So here we have an eigenvalue problem.

PCA

With an covariance matrix , we can get eigenvectors and eigenvalues.

\[\mathbf{R}\mathbf{q}_j = \lambda_j\mathbf{q}_j, j=1,2,…,m\]

  • are sorted from largest to smallest.
  • eigenvectors are arranged \[Q = [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_m]\]
  • Then we can write \[\mathbf{R}\mathbf{Q}=\mathbf{Q}\lambda\] where
  • is orthogonal

Summary

  • Eigenvectors of the covariance matrix of zero-mean random input vector define the principal directions , along which the variance of the projected inputs have extremal values.
  • Associated eigenvalues define the extremal values of the variance probe.

Usage

  • Project: \[\mathbf{a} = \mathbf{Q}^T\mathbf{x}\]
  • Recover: \[\mathbf{x} = \mathbf{Q} \mathbf{a}\]
  • We don’t need all principal directions, depending on how much variance is captured in the first few eigenvalues:
    • Can do dimensionality reduction

Dimensionality reduction

  • Encoding: We can use the first eigenvectors to encode . \[[a_1,a_2,…,a_l]^T=[\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l]^T\mathbf{x}\]
    • Only need to calculate projections
  • Decoding: Reconstruct full \[\mathbf{x}=\mathbf{Q}\mathbf{a} \approx [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l][a_1,a_2,…,a_l]^T=\hat{\mathbf{x}}\] Or alternatively, \[\hat{\mathbf{x}}=Q[a_1,a_2,…,a_l,0,0,…,0]^T\]
    with zeros.

Total variance

Total variance of the components of the data vector is:

\[\sum_{j=1}^{m} \sigma_j^2 = \sum_{j=1}^{m}\lambda_j\]

Truncated version with first components

\[\sum_{j=1}^{l} \sigma_j^2 = \sum_{j=1}^{l}\lambda_j\]

The larger the variance in the truncated version, the more accurate the dimensionality reduction.

Relation to NN: Hebbian-Based Maximum Eigenfilter

Oja(1982) shows that a single linear with Hebbian synapse can evolve into a filter for the first principal component of the input distribution.

  • Activation: \[y=\sum_{i=1}^{m} w_i x_i\]
  • Learning rule: \[w_i(n+1)=\frac{w_i(n)+\eta y(n)x_i(n)}{\left(\sum_{i=1}^{m} [w_i(n)+\eta y(n) x_i(n)]^2\right)^{1/2}}\]

Hebbian-Based Maximum Eigenfilter

Expanding the denominator as a power series, dropping the higher order terms

\[w_i(n+1) = w_i(n) + \eta y(n)[x_i(n)-y(n)w_i(n)] +O(\eta^2)\]

Algorithm

  • Activation: \[y(n)=\mathbf{x}^T(n)\mathbf{w}(n)=\mathbf{w}^T(n)\mathbf{x}(n)\]
  • Learning: \[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta y(n)[\mathbf{x}(n)-y(n)\mathbf{w}(n)]\]
  • Combining the above: \[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta[\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n) \\ -\mathbf{w}^T(n)\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n)\mathbf{w}(n)]\]

This is a nonlinear stochastic difference equation, which is hard to analyze.

Asymptotic stability theorem

To ease the analysis, learning rule is rewriten as:

\[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta(n)h(\mathbf{w}(n),\mathbf{x}(n))\]

The goal is to associate a deterministic ODE with the stochastic equation.

Under certain reasonable conditions on , , and , we get the asymptotic stability theorem stating that \[\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1\] infinitely often with probability 1. is the normalized eigenvector associated with the largest eigenvalue of the covariance matrix .

Conditions for stability

  1. is a decreasing sequence of positive real numbers such that
  2. Sequence of parameter vectors is bounded with probability 1.
  3. The update function is continuously differentiable w.r.t and , and it derivatives are bounded in time.
  4. The limit exists for each , where is a random vector
  5. There is a locally asymptotically stable solution to the ODE \[\frac{d}{dt}\mathbf{w}(t)=\hat{h}(\mathbf{w}(t))\]
  6. Let denote the solution the the ODE above with a basin of attraction . The parameter vector enters the compact subset of infinitely often with prob 1.

Summary

Hebbian-based linear neuron converges with prob 1 to a fixed point.

  • Variance of output: \[\lim_{n \rightarrow \infty} \sigma^2(n) = \lim_{n \rightarrow \infty} E[Y^2(n)] = \lambda_1\]
  • Synaptic weight approaches: \[\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1\]
    • with \[\lim_{n \rightarrow \infty} \Vert \mathbf{w}(n)\Vert = 1\]

Generalized Hebbian Algorithm for full PCA

  • Sanger(1989) show how to construct a feedforward network to learn all the eigenvectors of .
  • Activation \[y_j(n)=\sum_{i=1}^{m}w_{ji}(n)x_i(n), j = 1,2,…,l\]
  • Learning \[\Delta w_{ji}(n) = \eta \left[ y_j(n)x_i(n)-y_j(n)\sum_{k=1}^{j}w_{ki}(n)y_k(n)\right]; i= 1,2,…,m; j=1,2,…,l\]

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.