## Motivation

Shannon (1948) developed Information Theory concerning the communication process.

- It is a deep mathematical theory concerned with the essence of the communication process.
- Provides a framework for: efficiency of information representation, limitations in reliable transmission of information over a communication channel.
- Gives bounds on optimum representation and transmission of signals

### Motivation

**Maximum mutual information principle**(Linsker 1988): Synaptic connections of a multilayered neural network develop in such a way as to*maximize the amount of information preserved when signals are transformed at each processing stage of the network, subject to certain constraints.***Redundancy reduction**(Attneave 1954): “Major function of perceptual machinary is to strip away some of the redundancy of stimulation, to describe or encode information in a form more economical than that in which it impinges on the receptors”. In other words,*redundancy reduction = feature extraction*.

## Topics

- Entropy
- Mutual information
- Relative entropy
- Differential entropy of continuous random variables

## Entropy

- Notation: random variable, value of random variable.
- can be quantized into a finite number of discrete level although it can take continuous values. \[X={ x_k \vert k=0, \pm 1,…, \pm K}\]
- occur with probability \[p_k = P(X=x_k)\] with requirement \[0 \le p_k \le 1, \sum_{k=-K}^{K} p_k=1\]

### Uncertainty, Surprise, Info and Entropy

- If , where is observed. There is
**no surprise**,**less uncertain**.- High probability leads to less surprise and less uncertain

- Gain
**information**when going from a high-uncertainty state to a low-uncertainty state.

### Entropy

Uncertainty measurement for event ( assumes :

\[I(x_k) = \log(\frac{1}{p_k}) = - \log p_k\]

**Entropy** of a random variable: Average uncertainty

\[H(X) = E[I(x_k)] = - \sum_{k=-K}^{K} p_k\log p_k\]

### Properties

- The Higher the , the higher the
**potential information**you can gain through observation/measurement. - Bounds on the entropy: \[0 \le H(X) \le \log(2K+1)\]
- when and for : No uncertainty
- when for all : Maximum uncertainty, when all events are equiprobable.

- For two probability distributions and : \[\sum_k p_k \log \left( \frac{p_k}{q_k} \right) \ge 0\]
- Kullback-Leibler divergence (relative entropy): \[D_{p\Vert q} = \sum_{x \in \mathcal{X}} p_X(x) \log \left(\frac{p_X(x)}{q_X(x)}\right)\] meaures how different two probability distributions are.

### Differential entropy

\[h(X)=-\int_{-\infty}^{\infty} f_X(x)\log f_X(x) dx = -E[\log f_X(x)]\]

in the limit does not equal :

\[H(X)=h(X)-\lim_{\delta x \rightarrow 0}\log \delta x\]

#### Properties

- For vector random variable \[h(\mathbf{AX}) = h(\mathbf{X}) + \log \vert \det(\mathbf{A}) \vert\]

### Maximum entropy principle

Pick the probability distribution that maximizes the entropy, subject to constraints on the distribution.

### 1-D Gaussian distribution

- For a given variance , Gaussian random variable has the largest differential entropy
- The entropy of a Gaussian random variable is uniquely determined by the variance of

## Mutual Information

**Conditional entropy**: entropy in after observing (How much uncertainty) \[H(X\vert Y) = H(X,Y)-H(Y)\]- Joint-entropy is defined as: \[H(X,Y)=-\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)\]
**Mutual Info**: How much uncertainty is reduced in when we observe ? \[I(X;Y)=H(X)-H(X \vert Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\]

### Continuous random variable

\[I(X;Y)=h(X)-h(X \vert Y) \\ = h(Y)-h(Y \vert X) \\ = h(X)+ h(Y)-h(X , Y)\]

\[I(X;Y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x,y) \log \left( \frac{f_X(x\vert y)}{f_X(x)}\right)dxdy\]

### Properties of KL Divergence

- Always positive or zero. Zero means there is a perfect match between the two distributions.
- Invariant with regard to
- Permutation of the order in which the components of the vector random variable are arranged.
- Ampltitude scaling
- Monotonic nonlinear transformation

- Related to mutual info \[I(\mathbf{X};\mathbf{Y})=D_{f_{\mathbf{X},\mathbf{Y}}\Vert f_\mathbf{X} f_\mathbf{Y}}\]

### Application

We can use mutual info as an objective function to be optimized when developing learning rules for neural networks.

- Maximize mutual info between input vector and output vector
- Maximize mutual info between and driven by near-by input vectors and from a single image.
- Minimize information between and driven by input vectors from
*different*images. - Minimize statistical dependence between s.

### Maximum mutual info principle

**Infomax**principle by Linsker: Maximize for input vector and output vector .- Basis for statistical signmal processing
- A mathematical framework for self-organization
- Relation to
*channel capacity*, Shannon limit on the rate of info transmission through a communication channel.

### Example: Single neuron

#### Single neuron with additive output noise:

\[Y=\left( \sum_{i=1}^{m} w_i X_i \right) + N\]

where:

- is the output
- is the weight
- is the input
- is the processing noise

**Assumption**:

- Output is a Gaussian with variance
- Noise is also a Gaussian with and variance
- Input and noise are uncorrelated: for all

**Entropy**:

- Mutual info between input and output: \[I(Y;\mathbf{X})=h(Y)-h(Y\vert \mathbf{X})\]
- Since , where is a constant. \[h(Y\vert \mathbf{X}) = h(N)\]
- Given , what remains in is just noise . So \[I(Y;\mathbf{X})=h(Y)-h(N)\]

For a Gaussian distribution:

\[h(Y)=\frac{1}{2}[1+\log (2\pi \sigma_Y^2)]\]

Finally:

\[I(Y;\mathbf{X}) = \frac{1}{2} \log \left(\frac{\sigma_Y^2}{\sigma_N^2}\right)\]

So mutual information can be maximized simply by maximizing the output variance .

#### Single neuron with input noise:

\[Y= \sum_{i=1}^{m} w_i \left( X_i +N_i\right) \\ = \sum_{i=1}^{m} w_i X_i + \underbrace{\sum_{i=1}^{m} w_i N_i}_{\text{call this } N’}\]

is also a Gaussian distribution, with variance: \[\sigma_{N’}^2=\sum_{i=1}^{m} w_i^2 \sigma_N^2\]

Finally:

\[I(Y\vert \mathbf{X})=h(N’)=\frac{1}{2} \log\left(\frac{\sigma_Y^2}{\sigma_N^2\sum_{i=1}^{m} w_i^2}\right)\]

### Lesson

Adopting a Gaussian noise model, we can invoke a “surrogate” mutual information computed relatively easily, just compute variance.

### Noiseless Network

- Transform a random vector of arbitrary distribution to a new random vector of different distribution:
- Mutual info in this case: \[I(\mathbf{Y};\mathbf{X})= H(\mathbf{Y}) - H(\mathbf{Y} \vert \mathbf{X})\]
- Gradient: \[\frac{\partial I(\mathbf{Y};\mathbf{X})}{\partial \mathbf{W}} = \frac{\partial H(\mathbf{Y})}{\partial \mathbf{W}}\]
- Maximizing mutual info between input and output is equivalent to maximizing entropy in the output, both with respect to the weight vector

### Informax

- In Shannon’s framework, Order and structure = Redundancy.
- Increase in the above reduces uncertainty.
- More redundancy in the signal implies less information conveyed.
- More information conveyed means less redundancy.
- Thus, Infomax principle leads to reduced reduncancy in output compared to input .
- When noise is present:

– Input noise: add redundancy in input to combat noise.

– Output noise: add more output components to combat noise.

– Highlevelofnoisefavorsredundancyofrepresentation.

– Lowlevelofnoisefavorsdiversityofrepresentation.

### Principle of minimum redundancy

Notation:

- Sensory signal
- Noise input
- Recording system
- Noisy output

\[\mathbf{X} = \mathbf{S} + \mathbf{N}_1\]

\[\mathbf{Y} = \mathbf{AX} + \mathbf{N}_2\]

- Retinal input includes redundant information. Purpose of retinal coding is to reduce/eliminate the redundant bits of data due to correlations and noise, before sending the signal along the optic nerve.
- Redundancy measure (with channel capacity : \[R=1-\frac{I(\mathbf{Y};\mathbf{S})}{C(\mathbf{Y})}\]
- Objective: find recoder matrix such that is minimized, subject to the
*no info loss constraint*: \[I(\mathbf{Y};\mathbf{X}) = I(\mathbf{X};\mathbf{X}) -\epsilon\] - When and have the same dimensionality and there is no noise, principle of minimum redundancy is equivalent to the Infomax principle.
- Infomax on input/output lead to redundancy reduction

## Spatially coherent features

- Infomax for unsupervised processing of the image of natural scenes (Becker and Hinton, 1992).
**Goal**: design a self-organizing system that is capable of learning to encode complex scene information in a simpler form.**Objective**: extract higher-order features that exhibit simple coherence across space so that representation for one spatial region can be used to produce that of representation of neighboring regions.

Let denote a signal component common to both and . Output can be expressed in terms of and some noise:

\[Y_a = S+N_a\]

\[Y_b = S+N_b\]

where and are assumed to be independent and zero-mean Gaussian. Also assume is Gaussian.

The mutual info then:

\[I(Y_a;Y_b) = h(Y_a) + h(Y_b) - h(Y_a,Y_b)\]

With

\[h(Y_a) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_a^2\right)\right]\]

\[h(Y_b) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_b^2\right)\right]\]

\[h(Y_a,Y_b) = 1+ \log(2\pi) + \frac{1}{2} \log \vert \det(\Sigma) \vert\]

\[\Sigma = \left[ \begin{array}{cc} \sigma_a^2 & \rho_{ab}\sigma_a \sigma_b \\ \rho_{ab}\sigma_a \sigma_b & \sigma_b^2 \end{array}\right]\]

\[\rho_{ab} = \frac{E[(Y_a -E[Y_a])(Y_b-E[Y_b])]}{\sigma_a \sigma_b}\]

We get:

\[I(Y_a;Y_b) = - \frac{1}{2} \log \left( 1 - \rho_{ab}^2\right)\]

- Maximizing information is equivalent to maximizing correlation between and .
- Relation to
*canonical correlation*in statistics:- Given random input vectors and
- Find two weight vectors and so that
- and have
**maximum correlation**between them.

- Applications: stereo disparity extraction (Becker and Hinton, 1992)
- 2 images slightly different, find correlated part.

- When the inputs come from two separate regions, we want to minimize the mutual information between the two outputs (Ukrainec and Haykin, 1992, 1996).
- Applications include when input sources such as different polarizations of the signal are imaged: mutual information between outputs driven by two orthogonal polarizations should be minimized.

## Independent Components Analysis (ICA)

- Unknown random source vector : \[\mathbf{U} = [U_1,U_2,…,U_m]^T\] where components are supplied by a set of
*independent sources*. Note that we need a series of source vectors. -
is transformed by an unknown mixing matrix : \[\mathbf{X}=\mathbf{A}\mathbf{U}\] where \[\mathbf{X} = [X_1,X_2,…,X_m]^T\]

- Both and are
**unknown**. **Task**: find an estimate of the*inverse*of the mixing matrix(the**demixing matrix**) to recover the unknown source \[\mathbf{Y}=\mathbf{W}\mathbf{X}\]- Known as
**blind source separation**problem - Solution can be obtained by enforcing
**independence**among components of while adjusting

### Ambiguities

Consider and

- Permutation: , where is a permutation matrix.
- is like \[\left[ \begin{array}{cccc} 0 & 1 & 0 &0 \\ 1&0&0&0 \\ 0&0&1&0 \\ 0&0&0&1 \end{array}\right]\]
- Sign: the model is unaffected by multiplication of one of the sources by
- Scaling (variance): estimate scaling up and scaling down will given the same

### Independence

- Two random variable and are
*statistically independent*when \[f_{X,Y}(x,y) = f_X(x)f_Y(y)\] - Weaker form: \[E[XY]=E[X]E[Y]\]
- Gaussian are bad: When the unknown source is Gaussian, any orthogonal transformation results in the same Gaussian distribution.

#### Non-Gaussianity

- Non-Gaussianity can be used as a measure of independence
- One component of : \[Y_i = \underbrace{[W_{i1},W_{i2},…,W_{im}]\mathbf{A}}_{\text{call this } \mathbf{Z}_i^T}\mathbf{U}\]
- So is a linear combination of random variable (). So it is more Gaussian than any individual . Gaussianity is minimized when equals one of .

### ICA Mearsures of Non-Gaussianity

- Kurtosis
- Negentropy
- etc.

#### Kurtosis

- Kurtosis is the fourth-order cumulant: \[Kurtosis(Y)=E[Y^4]-3\left(E[Y^2]\right)^2\]
- Gaussian have 0 kurtosis.
- More peaked distributions have positive kurtosis.
- More flatter distributions have negative kurtosis.
**Learning**: Start with random , adjust and mearsure change in kurtosis, using gradient-based methods.**Drawback**: Kurtosis is sensitive to outliers, and thus not robust.

#### Negentropy

- Negentropy is \[J(\mathbf{Y}) = H(\mathbf{Y}_{gauss})-H(\mathbf{Y})\] where is a Gaussian random variable that has the same covariance matrix as
- Negentropy is always non-zero, and it’s zero iff is Gaussian.
- Thus, maximizing negentropy is to maximize non-Gaussianity
**Problem**: estimating negentropy is difficult, and requires the knowledge of the pdfs.

### Approximation of Negentropy

- Classical method(not robust): \[J(Y) \approx \frac{1}{2}E[Y^3]^2+\frac{1}{48}\text{Kurtosis}(Y)^2\]
- Another: \[J(Y) \approx \sum_{k=1}^{p}k_i(E[G_i(Y)]-E[G_i(N)])^2\]
- are coefficients
- are nonquadratic functions
- is a zero-mean,unit-variance Gaussian

- Further simplified: \[J(Y) \approx (E[G(Y)]-E[G(N)])^2\]

### ICA Minimizing mutual info

- We can also aim to minimize mutual info between
- This turns out to be equivalent to maximizing negentropy (when have unit variance) \[I(Y_1;Y_2;…;Y_m) = C - \sum_i J(Y_i)\] where is a constant independent from

### Achieving independence

- Given output vector , we want and to be statistically independent.
- Can be achieved by
- Another method: make pdf to approach the
*factorial distribution*: \[f_{\mathbf{Y}}(\mathbf{y},\mathbf{W}) = \prod_{i=1}^{m}\tilde{f}_{\mathbf{Y}}(y_i,\mathbf{W})\] where is the marginal pdf of . This can be measured by

**KL divergence** is

\[D_{f\Vert\tilde{f}}(\mathbf{W}) = -h(\mathbf{Y})+\sum_{i=1}^{m}\tilde{h}(Y_i)\]

Output entropy:

\[h(\mathbf{Y})=h(\mathbf{WX})=h(\mathbf{X})+\log \vert \det (\mathbf{W})\vert\]

Tricky to calculate , need a polynomial activation function

### Learning Weight

- Learning objective is to minimize the KL divergence
- Gradient descent: \[\Delta w_{ik}=-\eta \frac{\partial}{\partial w_{ik}} D_{f\Vert\tilde{f}} \\ = \eta \left( (\mathbf{W}^{-T})_{ik}-\varphi(y_i)x_k\right)\]
- Final rule: \[\mathbf{W}(n+1) = \mathbf{W}(n) + \eta(n) \left[\mathbf{I} -\varphi(\mathbf{y}(n))\mathbf{y}^T(n)\right]\mathbf{W}^{-T}(n)\]