2017-04-22

## Motivation

Project given data so that the variance in the projected points is maximized.

## Variance probe

• $\mathbf{X}$ is a m-dimentional random vector
• vector random variable following a certain probability distribution
• Assume $E[\mathbf{X}] = \mathbf{0}$
• Project a unit vector $\mathbf{q}$ on $\mathbf{X}$; get $A$
• Variance: $\sigma^2=E[A^2]=\mathbf{q}^T\mathbf{R}\mathbf{q}$

We would have a variance probe: $\psi(\mathbf{q})=\mathbf{q}^T\mathbf{R}\mathbf{q}$. With different unit vectors $\mathbf{q}$ will result in smaller or larger variance in the projected points.

To find the solution where the variance $\psi(\mathbf{q})$ has extremal value, solve for the unit vectors satisfying:

$\mathbf{Rq}=\lambda\mathbf{q}$

Where $\lambda$ is a scaling factor. So here we have an eigenvalue problem.

## PCA

With an $m \times m$ covariance matrix $\mathbf{R}$, we can get $m$ eigenvectors and $m$ eigenvalues.

$\mathbf{R}\mathbf{q}_j = \lambda_j\mathbf{q}_j, j=1,2,…,m$

• $\lambda_j$ are sorted from largest to smallest.
• eigenvectors are arranged $Q = [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_m]$
• Then we can write $\mathbf{R}\mathbf{Q}=\mathbf{Q}\lambda$ where $\lambda = \text{diag}(\lambda_1,\lambda_2,...,\lambda_m)$
• $\mathbf{Q}$ is orthogonal

### Summary

• Eigenvectors of the covariance matrix $\mathbf{R}$ of zero-mean random input vector $\mathbf{X}$ define the principal directions $\mathbf{q}_j$, along which the variance of the projected inputs have extremal values.
• Associated eigenvalues define the extremal values of the variance probe.

### Usage

• Project: $\mathbf{a} = \mathbf{Q}^T\mathbf{x}$
• Recover: $\mathbf{x} = \mathbf{Q} \mathbf{a}$
• We don’t need all $m$ principal directions, depending on how much variance is captured in the first few eigenvalues:
• Can do dimensionality reduction

### Dimensionality reduction

• Encoding: We can use the first $l$ eigenvectors to encode $\mathbf{x}$. $[a_1,a_2,…,a_l]^T=[\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l]^T\mathbf{x}$
• Only need to calculate $l$ projections
• Decoding: Reconstruct full $[x_1,x_2,...,x_m]^T$ $\mathbf{x}=\mathbf{Q}\mathbf{a} \approx [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l][a_1,a_2,…,a_l]^T=\hat{\mathbf{x}}$ Or alternatively, $\hat{\mathbf{x}}=Q[a_1,a_2,…,a_l,0,0,…,0]^T$
with $m-l$ zeros.

### Total variance

Total variance of the $m$ components of the data vector is:

$\sum_{j=1}^{m} \sigma_j^2 = \sum_{j=1}^{m}\lambda_j$

Truncated version with first $l$ components

$\sum_{j=1}^{l} \sigma_j^2 = \sum_{j=1}^{l}\lambda_j$

The larger the variance in the truncated version, the more accurate the dimensionality reduction.

## Relation to NN: Hebbian-Based Maximum Eigenfilter

Oja(1982) shows that a single linear with Hebbian synapse can evolve into a filter for the first principal component of the input distribution.

• Activation: $y=\sum_{i=1}^{m} w_i x_i$
• Learning rule: $w_i(n+1)=\frac{w_i(n)+\eta y(n)x_i(n)}{\left(\sum_{i=1}^{m} [w_i(n)+\eta y(n) x_i(n)]^2\right)^{1/2}}$

### Hebbian-Based Maximum Eigenfilter

Expanding the denominator as a power series, dropping the higher order terms

$w_i(n+1) = w_i(n) + \eta y(n)[x_i(n)-y(n)w_i(n)] +O(\eta^2)$

#### Algorithm

• Activation: $y(n)=\mathbf{x}^T(n)\mathbf{w}(n)=\mathbf{w}^T(n)\mathbf{x}(n)$
• Learning: $\mathbf{w}(n+1)=\mathbf{w}(n)+\eta y(n)[\mathbf{x}(n)-y(n)\mathbf{w}(n)]$
• Combining the above: $\mathbf{w}(n+1)=\mathbf{w}(n)+\eta[\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n) \\ -\mathbf{w}^T(n)\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n)\mathbf{w}(n)]$

This is a nonlinear stochastic difference equation, which is hard to analyze.

### Asymptotic stability theorem

To ease the analysis, learning rule is rewriten as:

$\mathbf{w}(n+1)=\mathbf{w}(n)+\eta(n)h(\mathbf{w}(n),\mathbf{x}(n))$

The goal is to associate a deterministic ODE with the stochastic equation.

Under certain reasonable conditions on $\eta$, $h(\centerdot,\centerdot)$, and $\mathbf{w}$, we get the asymptotic stability theorem stating that $\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1$ infinitely often with probability 1. $\mathbf{q}_1$ is the normalized eigenvector associated with the largest eigenvalue $\lambda_1$ of the covariance matrix $\mathbf{R}$.

### Conditions for stability

1. $\eta(n)$ is a decreasing sequence of positive real numbers such that
2. Sequence of parameter vectors $\mathbf{w}(\centerdot)$ is bounded with probability 1.
3. The update function $h(\mathbf{w},\mathbf{x})$ is continuously differentiable w.r.t $\mathbf{w}$ and $\mathbf{x}$, and it derivatives are bounded in time.
4. The limit $\bar{h}(\mathbf{w})=\lim_{n \rightarrow \infty}E[h(\mathbf{w},\mathbf{X})]$ exists for each $\mathbf{w}$, where $\mathbf{X}$ is a random vector
5. There is a locally asymptotically stable solution to the ODE $\frac{d}{dt}\mathbf{w}(t)=\hat{h}(\mathbf{w}(t))$
6. Let $\mathbf{q}_1$ denote the solution the the ODE above with a basin of attraction $\mathcal{B}(\mathbf{q})$. The parameter vector $\mathbf{w}(n)$ enters the compact subset $\mathcal{A}$ of $\mathcal{B}(\mathbf{q})$ infinitely often with prob 1.

### Summary

Hebbian-based linear neuron converges with prob 1 to a fixed point.

• Variance of output: $\lim_{n \rightarrow \infty} \sigma^2(n) = \lim_{n \rightarrow \infty} E[Y^2(n)] = \lambda_1$
• Synaptic weight approaches: $\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1$
• with $\lim_{n \rightarrow \infty} \Vert \mathbf{w}(n)\Vert = 1$

### Generalized Hebbian Algorithm for full PCA

• Sanger(1989) show how to construct a feedforward network to learn all the eigenvectors of $\mathbf{R}$.
• Activation $y_j(n)=\sum_{i=1}^{m}w_{ji}(n)x_i(n), j = 1,2,…,l$
• Learning $\Delta w_{ji}(n) = \eta \left[ y_j(n)x_i(n)-y_j(n)\sum_{k=1}^{j}w_{ki}(n)y_k(n)\right]; i= 1,2,…,m; j=1,2,…,l$