Motivation
Project given data so that the variance in the projected points is maximized.
Variance probe
- is a m-dimentional random vector
- vector random variable following a certain probability distribution
- Assume
- Project a unit vector on ; get
- Variance:
We would have a variance probe: . With different unit vectors will result in smaller or larger variance in the projected points.
To find the solution where the variance has extremal value, solve for the unit vectors satisfying:
\[\mathbf{Rq}=\lambda\mathbf{q}\]
Where is a scaling factor. So here we have an eigenvalue problem.
PCA
With an covariance matrix , we can get eigenvectors and eigenvalues.
\[\mathbf{R}\mathbf{q}_j = \lambda_j\mathbf{q}_j, j=1,2,…,m\]
- are sorted from largest to smallest.
- eigenvectors are arranged \[Q = [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_m]\]
- Then we can write \[\mathbf{R}\mathbf{Q}=\mathbf{Q}\lambda\] where
- is orthogonal
Summary
- Eigenvectors of the covariance matrix of zero-mean random input vector define the principal directions , along which the variance of the projected inputs have extremal values.
- Associated eigenvalues define the extremal values of the variance probe.
Usage
- Project: \[\mathbf{a} = \mathbf{Q}^T\mathbf{x}\]
- Recover: \[\mathbf{x} = \mathbf{Q} \mathbf{a}\]
- We don’t need all principal directions, depending on how much variance is captured in the first few eigenvalues:
- Can do dimensionality reduction
Dimensionality reduction
- Encoding: We can use the first eigenvectors to encode . \[[a_1,a_2,…,a_l]^T=[\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l]^T\mathbf{x}\]
- Only need to calculate projections
- Decoding: Reconstruct full \[\mathbf{x}=\mathbf{Q}\mathbf{a} \approx [\mathbf{q}_1,\mathbf{q}_2,…,\mathbf{q}_l][a_1,a_2,…,a_l]^T=\hat{\mathbf{x}}\] Or alternatively, \[\hat{\mathbf{x}}=Q[a_1,a_2,…,a_l,0,0,…,0]^T\]
with zeros.
Total variance
Total variance of the components of the data vector is:
\[\sum_{j=1}^{m} \sigma_j^2 = \sum_{j=1}^{m}\lambda_j\]
Truncated version with first components
\[\sum_{j=1}^{l} \sigma_j^2 = \sum_{j=1}^{l}\lambda_j\]
The larger the variance in the truncated version, the more accurate the dimensionality reduction.
Relation to NN: Hebbian-Based Maximum Eigenfilter
Oja(1982) shows that a single linear with Hebbian synapse can evolve into a filter for the first principal component of the input distribution.
- Activation: \[y=\sum_{i=1}^{m} w_i x_i\]
- Learning rule: \[w_i(n+1)=\frac{w_i(n)+\eta y(n)x_i(n)}{\left(\sum_{i=1}^{m} [w_i(n)+\eta y(n) x_i(n)]^2\right)^{1/2}}\]
Hebbian-Based Maximum Eigenfilter
Expanding the denominator as a power series, dropping the higher order terms
\[w_i(n+1) = w_i(n) + \eta y(n)[x_i(n)-y(n)w_i(n)] +O(\eta^2)\]
Algorithm
- Activation: \[y(n)=\mathbf{x}^T(n)\mathbf{w}(n)=\mathbf{w}^T(n)\mathbf{x}(n)\]
- Learning: \[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta y(n)[\mathbf{x}(n)-y(n)\mathbf{w}(n)]\]
- Combining the above: \[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta[\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n) \\ -\mathbf{w}^T(n)\mathbf{x}(n)\mathbf{x}^T(n)\mathbf{w}(n)\mathbf{w}(n)]\]
This is a nonlinear stochastic difference equation, which is hard to analyze.
Asymptotic stability theorem
To ease the analysis, learning rule is rewriten as:
\[\mathbf{w}(n+1)=\mathbf{w}(n)+\eta(n)h(\mathbf{w}(n),\mathbf{x}(n))\]
The goal is to associate a deterministic ODE with the stochastic equation.
Under certain reasonable conditions on , , and , we get the asymptotic stability theorem stating that \[\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1\] infinitely often with probability 1. is the normalized eigenvector associated with the largest eigenvalue of the covariance matrix .
Conditions for stability
- is a decreasing sequence of positive real numbers such that
- Sequence of parameter vectors is bounded with probability 1.
- The update function is continuously differentiable w.r.t and , and it derivatives are bounded in time.
- The limit exists for each , where is a random vector
- There is a locally asymptotically stable solution to the ODE \[\frac{d}{dt}\mathbf{w}(t)=\hat{h}(\mathbf{w}(t))\]
- Let denote the solution the the ODE above with a basin of attraction . The parameter vector enters the compact subset of infinitely often with prob 1.
Summary
Hebbian-based linear neuron converges with prob 1 to a fixed point.
- Variance of output: \[\lim_{n \rightarrow \infty} \sigma^2(n) = \lim_{n \rightarrow \infty} E[Y^2(n)] = \lambda_1\]
- Synaptic weight approaches: \[\lim_{n \rightarrow \infty} \mathbf{w}(n) = \mathbf{q}_1\]
- with \[\lim_{n \rightarrow \infty} \Vert \mathbf{w}(n)\Vert = 1\]
Generalized Hebbian Algorithm for full PCA
- Sanger(1989) show how to construct a feedforward network to learn all the eigenvectors of .
- Activation \[y_j(n)=\sum_{i=1}^{m}w_{ji}(n)x_i(n), j = 1,2,…,l\]
- Learning \[\Delta w_{ji}(n) = \eta \left[ y_j(n)x_i(n)-y_j(n)\sum_{k=1}^{j}w_{ki}(n)y_k(n)\right]; i= 1,2,…,m; j=1,2,…,l\]