TAMU Neural Network 9 Principal Component Analysis

Motivation

Project given data so that the variance in the projected points is maximized.

Variance probe

is a m-dimentional random vector
- vector random variable following a certain probability distribution
Assume $E[\mathbf{X}] = \mathbf{0}$
Project a unit vector on ; get
- $(\mathbf{q}\mathbf{q}^T)^{1/2}=1$
- $A = \mathbf{X}^T\mathbf{q} = \mathbf{q}^T\mathbf{X}$
$E[A]=E[\mathbf{q}^T\mathbf{X}]=\mathbf{q}^TE[\mathbf{X}] = 0$
Variance:
- $\mathbf{R} = E[\mathbf{X}\mathbf{X}^T]$

We would have a variance probe: $\psi(\mathbf{q})=\mathbf{q}^T\mathbf{R}\mathbf{q}$ . With different unit vectors $\mathbf{q}$ will result in smaller or larger variance in the projected points.

To find the solution where the variance $\psi(\mathbf{q})$ has extremal value, solve for the unit vectors satisfying:

\[\mathbf{Rq}=\lambda\mathbf{q}\]

Where $\lambda$ is a scaling factor. So here we have an eigenvalue problem.

PCA

With an $m \times m$ covariance matrix $\mathbf{R}$ , we can get $m$ eigenvectors and $m$ eigenvalues.

\[\mathbf{R}\mathbf{q}_j = \lambda_j\mathbf{q}_j, j=1,2,…,m\]

$\lambda_j$ are sorted from largest to smallest.
eigenvectors are arranged \[Q = [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_m]\]
Then we can write \[\mathbf{R}\mathbf{Q}=\mathbf{Q}\lambda\] where $\lambda = \text{diag}(\lambda_1,\lambda_2,...,\lambda_m)$
is orthogonal
- $\mathbf{Q}\mathbf{Q}^T=\mathbf{I}$

Summary

Eigenvectors of the covariance matrix $\mathbf{R}$ of zero-mean random input vector $\mathbf{X}$ define the principal directions $\mathbf{q}_j$ , along which the variance of the projected inputs have extremal values.
Associated eigenvalues define the extremal values of the variance probe.

Usage

Project: \[\mathbf{a} = \mathbf{Q}^T\mathbf{x}\]
Recover: \[\mathbf{x} = \mathbf{Q} \mathbf{a}\]
We don’t need all principal directions, depending on how much variance is captured in the first few eigenvalues:
- Can do dimensionality reduction

Dimensionality reduction

Encoding: We can use the first eigenvectors to encode . \[[a_1,a_2,…,a_l]^T=[\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l]^T\mathbf{x}\]
- Only need to calculate $l$ projections
Decoding: Reconstruct full $[x_1,x_2,...,x_m]^T$ \[\mathbf{x}=\mathbf{Q}\mathbf{a} \approx [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l][a_1,a_2,…,a_l]^T=\hat{\mathbf{x}}\] Or alternatively, \[\hat{\mathbf{x}}=Q[a_1,a_2,…,a_l,0,0,…,0]^T\]
with $m-l$ zeros.

Total variance

Total variance of the $m$ components of the data vector is:

\[\sum_{j=1}^{m} \sigma_j^2 = \sum_{j=1}^{m}\lambda_j\]

Truncated version with first $l$ components

\[\sum_{j=1}^{l} \sigma_j^2 = \sum_{j=1}^{l}\lambda_j\]

The larger the variance in the truncated version, the more accurate the dimensionality reduction.

Relation to NN: Hebbian-Based Maximum Eigenfilter

Oja(1982) shows that a single linear with Hebbian synapse can evolve into a filter for the first principal component of the input distribution.

Activation: \[y=\sum_{i=1}^{m} w_i x_i\]
Learning rule: \[w_i(n+1)=\frac{w_i(n)+\eta y(n)x_i(n)}{\left(\sum_{i=1}^{m} [w_i(n)+\eta y(n) x_i(n)]^2\right)^{1/2}}\]

Hebbian-Based Maximum Eigenfilter

Expanding the denominator as a power series, dropping the higher order terms

\[w_i(n+1) = w_i(n) + \eta y(n)[x_i(n)-y(n)w_i(n)] +O(\eta^2)\]

$w_i(n+1) = w_i(n)+\eta\left( \underbrace{y(n)x_i(n)}_{\text{Hebbian term}}+\underbrace{y(n)^2w_i(n)}_{\text{Stabilization term}}\right)$

Algorithm

Activation: \[y(n)=\mathbf{x}^T(n)\mathbf{w}(n)=\mathbf{w}^T(n)\mathbf{x}(n)\]
Learning: \[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta y(n)[\mathbf{x}(n)-y(n)\mathbf{w}(n)]\]
Combining the above: \[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta[\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n) \\ -\mathbf{w}^T(n)\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n)\mathbf{w}(n)]\]

This is a nonlinear stochastic difference equation, which is hard to analyze.

Asymptotic stability theorem

To ease the analysis, learning rule is rewriten as:

\[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta(n)h(\mathbf{w}(n),\mathbf{x}(n))\]

The goal is to associate a deterministic ODE with the stochastic equation.

Under certain reasonable conditions on $\eta$ , $h(\centerdot,\centerdot)$ , and $\mathbf{w}$ , we get the asymptotic stability theorem stating that \[\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1\] infinitely often with probability 1. $\mathbf{q}_1$ is the normalized eigenvector associated with the largest eigenvalue $\lambda_1$ of the covariance matrix $\mathbf{R}$ .

Conditions for stability

is a decreasing sequence of positive real numbers such that
- $\sum_{n=1}^\infty \eta(n) = \infty$
- $\sum_{n=1}^\infty \eta^p(n) < \infty \mbox{ for } p > 1$
- $\eta(n) \rightarrow 0 \mbox{ as } n \rightarrow \infty$
Sequence of parameter vectors $\mathbf{w}(\centerdot)$ is bounded with probability 1.
The update function $h(\mathbf{w},\mathbf{x})$ is continuously differentiable w.r.t $\mathbf{w}$ and $\mathbf{x}$ , and it derivatives are bounded in time.
The limit $\bar{h}(\mathbf{w})=\lim_{n \rightarrow \infty}E[h(\mathbf{w},\mathbf{X})]$ exists for each $\mathbf{w}$ , where $\mathbf{X}$ is a random vector
There is a locally asymptotically stable solution to the ODE \[\frac{d}{dt}\mathbf{w}(t)=\hat{h}(\mathbf{w}(t))\]
Let $\mathbf{q}_1$ denote the solution the the ODE above with a basin of attraction $\mathcal{B}(\mathbf{q})$ . The parameter vector $\mathbf{w}(n)$ enters the compact subset $\mathcal{A}$ of $\mathcal{B}(\mathbf{q})$ infinitely often with prob 1.

Summary

Hebbian-based linear neuron converges with prob 1 to a fixed point.

Variance of output: \[\lim_{n \rightarrow \infty} \sigma^2(n) = \lim_{n \rightarrow \infty} E[Y^2(n)] = \lambda_1\]
Synaptic weight approaches: \[\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1\]
- with \[\lim_{n \rightarrow \infty} \Vert \mathbf{w}(n)\Vert = 1\]

Generalized Hebbian Algorithm for full PCA

Sanger(1989) show how to construct a feedforward network to learn all the eigenvectors of $\mathbf{R}$ .
Activation \[y_j(n)=\sum_{i=1}^{m}w_{ji}(n)x_i(n), j = 1,2,…,l\]
Learning \[\Delta w_{ji}(n) = \eta \left[ y_j(n)x_i(n)-y_j(n)\sum_{k=1}^{j}w_{ki}(n)y_k(n)\right]; i= 1,2,…,m; j=1,2,…,l\]

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1