Motivation
Shannon (1948) developed Information Theory concerning the communication process.
- It is a deep mathematical theory concerned with the essence of the communication process.
- Provides a framework for: efficiency of information representation, limitations in reliable transmission of information over a communication channel.
- Gives bounds on optimum representation and transmission of signals
Motivation
- Maximum mutual information principle (Linsker 1988): Synaptic connections of a multilayered neural network develop in such a way as to maximize the amount of information preserved when signals are transformed at each processing stage of the network, subject to certain constraints.
- Redundancy reduction (Attneave 1954): “Major function of perceptual machinary is to strip away some of the redundancy of stimulation, to describe or encode information in a form more economical than that in which it impinges on the receptors”. In other words, redundancy reduction = feature extraction.
Topics
- Entropy
- Mutual information
- Relative entropy
- Differential entropy of continuous random variables
Entropy
- Notation: random variable, value of random variable.
- can be quantized into a finite number of discrete level although it can take continuous values. \[X={ x_k \vert k=0, \pm 1,…, \pm K}\]
- occur with probability \[p_k = P(X=x_k)\] with requirement \[0 \le p_k \le 1, \sum_{k=-K}^{K} p_k=1\]
Uncertainty, Surprise, Info and Entropy
- If , where is observed. There is no surprise, less uncertain.
- High probability leads to less surprise and less uncertain
- Gain information when going from a high-uncertainty state to a low-uncertainty state.
Entropy
Uncertainty measurement for event ( assumes :
\[I(x_k) = \log(\frac{1}{p_k}) = - \log p_k\]
Entropy of a random variable: Average uncertainty
\[H(X) = E[I(x_k)] = - \sum_{k=-K}^{K} p_k\log p_k\]
Properties
- The Higher the , the higher the potential information you can gain through observation/measurement.
- Bounds on the entropy: \[0 \le H(X) \le \log(2K+1)\]
- when and for : No uncertainty
- when for all : Maximum uncertainty, when all events are equiprobable.
- For two probability distributions and : \[\sum_k p_k \log \left( \frac{p_k}{q_k} \right) \ge 0\]
- Kullback-Leibler divergence (relative entropy): \[D_{p\Vert q} = \sum_{x \in \mathcal{X}} p_X(x) \log \left(\frac{p_X(x)}{q_X(x)}\right)\] meaures how different two probability distributions are.
Differential entropy
\[h(X)=-\int_{-\infty}^{\infty} f_X(x)\log f_X(x) dx = -E[\log f_X(x)]\]
in the limit does not equal :
\[H(X)=h(X)-\lim_{\delta x \rightarrow 0}\log \delta x\]
Properties
- For vector random variable \[h(\mathbf{AX}) = h(\mathbf{X}) + \log \vert \det(\mathbf{A}) \vert\]
Maximum entropy principle
Pick the probability distribution that maximizes the entropy, subject to constraints on the distribution.
1-D Gaussian distribution
- For a given variance , Gaussian random variable has the largest differential entropy
- The entropy of a Gaussian random variable is uniquely determined by the variance of
Mutual Information
- Conditional entropy: entropy in after observing (How much uncertainty) \[H(X\vert Y) = H(X,Y)-H(Y)\]
- Joint-entropy is defined as: \[H(X,Y)=-\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)\]
- Mutual Info: How much uncertainty is reduced in when we observe ? \[I(X;Y)=H(X)-H(X \vert Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\]
Continuous random variable
\[I(X;Y)=h(X)-h(X \vert Y) \\ = h(Y)-h(Y \vert X) \\ = h(X)+ h(Y)-h(X , Y)\]
\[I(X;Y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x,y) \log \left( \frac{f_X(x\vert y)}{f_X(x)}\right)dxdy\]
Properties of KL Divergence
- Always positive or zero. Zero means there is a perfect match between the two distributions.
- Invariant with regard to
- Permutation of the order in which the components of the vector random variable are arranged.
- Ampltitude scaling
- Monotonic nonlinear transformation
- Related to mutual info \[I(\mathbf{X};\mathbf{Y})=D_{f_{\mathbf{X},\mathbf{Y}}\Vert f_\mathbf{X} f_\mathbf{Y}}\]
Application
We can use mutual info as an objective function to be optimized when developing learning rules for neural networks.
- Maximize mutual info between input vector and output vector
- Maximize mutual info between and driven by near-by input vectors and from a single image.
- Minimize information between and driven by input vectors from different images.
- Minimize statistical dependence between s.
Maximum mutual info principle
- Infomax principle by Linsker: Maximize for input vector and output vector .
- Basis for statistical signmal processing
- A mathematical framework for self-organization
- Relation to channel capacity, Shannon limit on the rate of info transmission through a communication channel.
Example: Single neuron
Single neuron with additive output noise:
\[Y=\left( \sum_{i=1}^{m} w_i X_i \right) + N\]
where:
- is the output
- is the weight
- is the input
- is the processing noise
Assumption:
- Output is a Gaussian with variance
- Noise is also a Gaussian with and variance
- Input and noise are uncorrelated: for all
Entropy:
- Mutual info between input and output: \[I(Y;\mathbf{X})=h(Y)-h(Y\vert \mathbf{X})\]
- Since , where is a constant. \[h(Y\vert \mathbf{X}) = h(N)\]
- Given , what remains in is just noise . So \[I(Y;\mathbf{X})=h(Y)-h(N)\]
For a Gaussian distribution:
\[h(Y)=\frac{1}{2}[1+\log (2\pi \sigma_Y^2)]\]
Finally:
\[I(Y;\mathbf{X}) = \frac{1}{2} \log \left(\frac{\sigma_Y^2}{\sigma_N^2}\right)\]
So mutual information can be maximized simply by maximizing the output variance .
Single neuron with input noise:
\[Y= \sum_{i=1}^{m} w_i \left( X_i +N_i\right) \\ = \sum_{i=1}^{m} w_i X_i + \underbrace{\sum_{i=1}^{m} w_i N_i}_{\text{call this } N’}\]
is also a Gaussian distribution, with variance: \[\sigma_{N’}^2=\sum_{i=1}^{m} w_i^2 \sigma_N^2\]
Finally:
\[I(Y\vert \mathbf{X})=h(N’)=\frac{1}{2} \log\left(\frac{\sigma_Y^2}{\sigma_N^2\sum_{i=1}^{m} w_i^2}\right)\]
Lesson
Adopting a Gaussian noise model, we can invoke a “surrogate” mutual information computed relatively easily, just compute variance.
Noiseless Network
- Transform a random vector of arbitrary distribution to a new random vector of different distribution:
- Mutual info in this case: \[I(\mathbf{Y};\mathbf{X})= H(\mathbf{Y}) - H(\mathbf{Y} \vert \mathbf{X})\]
- Gradient: \[\frac{\partial I(\mathbf{Y};\mathbf{X})}{\partial \mathbf{W}} = \frac{\partial H(\mathbf{Y})}{\partial \mathbf{W}}\]
- Maximizing mutual info between input and output is equivalent to maximizing entropy in the output, both with respect to the weight vector
Informax
- In Shannon’s framework, Order and structure = Redundancy.
- Increase in the above reduces uncertainty.
- More redundancy in the signal implies less information conveyed.
- More information conveyed means less redundancy.
- Thus, Infomax principle leads to reduced reduncancy in output compared to input .
- When noise is present:
– Input noise: add redundancy in input to combat noise.
– Output noise: add more output components to combat noise.
– Highlevelofnoisefavorsredundancyofrepresentation.
– Lowlevelofnoisefavorsdiversityofrepresentation.
Principle of minimum redundancy
Notation:
- Sensory signal
- Noise input
- Recording system
- Noisy output
\[\mathbf{X} = \mathbf{S} + \mathbf{N}_1\]
\[\mathbf{Y} = \mathbf{AX} + \mathbf{N}_2\]
- Retinal input includes redundant information. Purpose of retinal coding is to reduce/eliminate the redundant bits of data due to correlations and noise, before sending the signal along the optic nerve.
- Redundancy measure (with channel capacity : \[R=1-\frac{I(\mathbf{Y};\mathbf{S})}{C(\mathbf{Y})}\]
- Objective: find recoder matrix such that is minimized, subject to the no info loss constraint: \[I(\mathbf{Y};\mathbf{X}) = I(\mathbf{X};\mathbf{X}) -\epsilon\]
- When and have the same dimensionality and there is no noise, principle of minimum redundancy is equivalent to the Infomax principle.
- Infomax on input/output lead to redundancy reduction
Spatially coherent features
- Infomax for unsupervised processing of the image of natural scenes (Becker and Hinton, 1992).
- Goal: design a self-organizing system that is capable of learning to encode complex scene information in a simpler form.
- Objective: extract higher-order features that exhibit simple coherence across space so that representation for one spatial region can be used to produce that of representation of neighboring regions.
Let denote a signal component common to both and . Output can be expressed in terms of and some noise:
\[Y_a = S+N_a\]
\[Y_b = S+N_b\]
where and are assumed to be independent and zero-mean Gaussian. Also assume is Gaussian.
The mutual info then:
\[I(Y_a;Y_b) = h(Y_a) + h(Y_b) - h(Y_a,Y_b)\]
With
\[h(Y_a) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_a^2\right)\right]\]
\[h(Y_b) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_b^2\right)\right]\]
\[h(Y_a,Y_b) = 1+ \log(2\pi) + \frac{1}{2} \log \vert \det(\Sigma) \vert\]
\[\Sigma = \left[ \begin{array}{cc} \sigma_a^2 & \rho_{ab}\sigma_a \sigma_b \\ \rho_{ab}\sigma_a \sigma_b & \sigma_b^2 \end{array}\right]\]
\[\rho_{ab} = \frac{E[(Y_a -E[Y_a])(Y_b-E[Y_b])]}{\sigma_a \sigma_b}\]
We get:
\[I(Y_a;Y_b) = - \frac{1}{2} \log \left( 1 - \rho_{ab}^2\right)\]
- Maximizing information is equivalent to maximizing correlation between and .
- Relation to canonical correlation in statistics:
- Given random input vectors and
- Find two weight vectors and so that
- and have maximum correlation between them.
- Applications: stereo disparity extraction (Becker and Hinton, 1992)
- 2 images slightly different, find correlated part.
- When the inputs come from two separate regions, we want to minimize the mutual information between the two outputs (Ukrainec and Haykin, 1992, 1996).
- Applications include when input sources such as different polarizations of the signal are imaged: mutual information between outputs driven by two orthogonal polarizations should be minimized.
Independent Components Analysis (ICA)
- Unknown random source vector : \[\mathbf{U} = [U_1,U_2,…,U_m]^T\] where components are supplied by a set of independent sources. Note that we need a series of source vectors.
-
is transformed by an unknown mixing matrix : \[\mathbf{X}=\mathbf{A}\mathbf{U}\] where \[\mathbf{X} = [X_1,X_2,…,X_m]^T\]
- Both and are unknown.
- Task: find an estimate of the inverse of the mixing matrix(the demixing matrix ) to recover the unknown source \[\mathbf{Y}=\mathbf{W}\mathbf{X}\]
- Known as blind source separation problem
- Solution can be obtained by enforcing independence among components of while adjusting
Ambiguities
Consider and
- Permutation: , where is a permutation matrix.
- is like \[\left[ \begin{array}{cccc} 0 & 1 & 0 &0 \\ 1&0&0&0 \\ 0&0&1&0 \\ 0&0&0&1 \end{array}\right]\]
- Sign: the model is unaffected by multiplication of one of the sources by
- Scaling (variance): estimate scaling up and scaling down will given the same
Independence
- Two random variable and are statistically independent when \[f_{X,Y}(x,y) = f_X(x)f_Y(y)\]
- Weaker form: \[E[XY]=E[X]E[Y]\]
- Gaussian are bad: When the unknown source is Gaussian, any orthogonal transformation results in the same Gaussian distribution.
Non-Gaussianity
- Non-Gaussianity can be used as a measure of independence
- One component of : \[Y_i = \underbrace{[W_{i1},W_{i2},…,W_{im}]\mathbf{A}}_{\text{call this } \mathbf{Z}_i^T}\mathbf{U}\]
- So is a linear combination of random variable (). So it is more Gaussian than any individual . Gaussianity is minimized when equals one of .
ICA Mearsures of Non-Gaussianity
- Kurtosis
- Negentropy
- etc.
Kurtosis
- Kurtosis is the fourth-order cumulant: \[Kurtosis(Y)=E[Y^4]-3\left(E[Y^2]\right)^2\]
- Gaussian have 0 kurtosis.
- More peaked distributions have positive kurtosis.
- More flatter distributions have negative kurtosis.
- Learning: Start with random , adjust and mearsure change in kurtosis, using gradient-based methods.
- Drawback: Kurtosis is sensitive to outliers, and thus not robust.
Negentropy
- Negentropy is \[J(\mathbf{Y}) = H(\mathbf{Y}_{gauss})-H(\mathbf{Y})\] where is a Gaussian random variable that has the same covariance matrix as
- Negentropy is always non-zero, and it’s zero iff is Gaussian.
- Thus, maximizing negentropy is to maximize non-Gaussianity
- Problem: estimating negentropy is difficult, and requires the knowledge of the pdfs.
Approximation of Negentropy
- Classical method(not robust): \[J(Y) \approx \frac{1}{2}E[Y^3]^2+\frac{1}{48}\text{Kurtosis}(Y)^2\]
- Another: \[J(Y) \approx \sum_{k=1}^{p}k_i(E[G_i(Y)]-E[G_i(N)])^2\]
- are coefficients
- are nonquadratic functions
- is a zero-mean,unit-variance Gaussian
- Further simplified: \[J(Y) \approx (E[G(Y)]-E[G(N)])^2\]
ICA Minimizing mutual info
- We can also aim to minimize mutual info between
- This turns out to be equivalent to maximizing negentropy (when have unit variance) \[I(Y_1;Y_2;…;Y_m) = C - \sum_i J(Y_i)\] where is a constant independent from
Achieving independence
- Given output vector , we want and to be statistically independent.
- Can be achieved by
- Another method: make pdf to approach the factorial distribution: \[f_{\mathbf{Y}}(\mathbf{y},\mathbf{W}) = \prod_{i=1}^{m}\tilde{f}_{\mathbf{Y}}(y_i,\mathbf{W})\] where is the marginal pdf of . This can be measured by
KL divergence is
\[D_{f\Vert\tilde{f}}(\mathbf{W}) = -h(\mathbf{Y})+\sum_{i=1}^{m}\tilde{h}(Y_i)\]
Output entropy:
\[h(\mathbf{Y})=h(\mathbf{WX})=h(\mathbf{X})+\log \vert \det (\mathbf{W})\vert\]
Tricky to calculate , need a polynomial activation function
Learning Weight
- Learning objective is to minimize the KL divergence
- Gradient descent: \[\Delta w_{ik}=-\eta \frac{\partial}{\partial w_{ik}} D_{f\Vert\tilde{f}} \\ = \eta \left( (\mathbf{W}^{-T})_{ik}-\varphi(y_i)x_k\right)\]
- Final rule: \[\mathbf{W}(n+1) = \mathbf{W}(n) + \eta(n) \left[\mathbf{I} -\varphi(\mathbf{y}(n))\mathbf{y}^T(n)\right]\mathbf{W}^{-T}(n)\]