2017-04-25

## Motivation

Shannon (1948) developed Information Theory concerning the communication process.

• It is a deep mathematical theory concerned with the essence of the communication process.
• Provides a framework for: efficiency of information representation, limitations in reliable transmission of information over a communication channel.
• Gives bounds on optimum representation and transmission of signals

### Motivation

• Maximum mutual information principle (Linsker 1988): Synaptic connections of a multilayered neural network develop in such a way as to maximize the amount of information preserved when signals are transformed at each processing stage of the network, subject to certain constraints.
• Redundancy reduction (Attneave 1954): “Major function of perceptual machinary is to strip away some of the redundancy of stimulation, to describe or encode information in a form more economical than that in which it impinges on the receptors”. In other words, redundancy reduction = feature extraction.

## Topics

• Entropy
• Mutual information
• Relative entropy
• Differential entropy of continuous random variables

## Entropy

• Notation: $X$ random variable, $x$ value of random variable.
• $X$ can be quantized into a finite number of discrete level although it can take continuous values. $X={ x_k \vert k=0, \pm 1,…, \pm K}$
• $X = x_k$ occur with probability $p_k = P(X=x_k)$ with requirement $0 \le p_k \le 1, \sum_{k=-K}^{K} p_k=1$

### Uncertainty, Surprise, Info and Entropy

• If $p_k = 1$, where $X = x_k$ is observed. There is no surprise, less uncertain.
• High probability leads to less surprise and less uncertain
• Gain information when going from a high-uncertainty state to a low-uncertainty state.

### Entropy

Uncertainty measurement for event $X=x_k$ ($\log$ assumes $\log_2)$:

$I(x_k) = \log(\frac{1}{p_k}) = - \log p_k$

Entropy of a random variable: Average uncertainty

$H(X) = E[I(x_k)] = - \sum_{k=-K}^{K} p_k\log p_k$

### Properties

• The Higher the $H(X)$, the higher the potential information you can gain through observation/measurement.
• Bounds on the entropy: $0 \le H(X) \le \log(2K+1)$
• $H(X)=0$ when $p_k=1$ and $p_j=0$ for $j \ne k$: No uncertainty
• $H(K) = \log(2K+1)$ when $p_k = 1/(2K+1)$ for all $k$: Maximum uncertainty, when all events are equiprobable.
• For two probability distributions $\{p_k\}$ and $\{q_k\}$: $\sum_k p_k \log \left( \frac{p_k}{q_k} \right) \ge 0$
• Kullback-Leibler divergence (relative entropy): $D_{p\Vert q} = \sum_{x \in \mathcal{X}} p_X(x) \log \left(\frac{p_X(x)}{q_X(x)}\right)$ meaures how different two probability distributions are.

### Differential entropy

$h(X)=-\int_{-\infty}^{\infty} f_X(x)\log f_X(x) dx = -E[\log f_X(x)]$

$H(X)$ in the limit does not equal $h(X)$:

$H(X)=h(X)-\lim_{\delta x \rightarrow 0}\log \delta x$

#### Properties

• For vector random variable $\mathbf{X}$ $h(\mathbf{AX}) = h(\mathbf{X}) + \log \vert \det(\mathbf{A}) \vert$

### Maximum entropy principle

Pick the probability distribution that maximizes the entropy, subject to constraints on the distribution.

### 1-D Gaussian distribution

• For a given variance $\sigma^2$, Gaussian random variable has the largest differential entropy
• The entropy of a Gaussian random variable $X$ is uniquely determined by the variance of $X$

## Mutual Information

• Conditional entropy: entropy in $X$ after observing $Y$ (How much uncertainty) $H(X\vert Y) = H(X,Y)-H(Y)$
• Joint-entropy is defined as: $H(X,Y)=-\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)$
• Mutual Info: How much uncertainty is reduced in $X$ when we observe $Y$? $I(X;Y)=H(X)-H(X \vert Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$

### Continuous random variable

$I(X;Y)=h(X)-h(X \vert Y) \\ = h(Y)-h(Y \vert X) \\ = h(X)+ h(Y)-h(X , Y)$

$I(X;Y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x,y) \log \left( \frac{f_X(x\vert y)}{f_X(x)}\right)dxdy$

### Properties of KL Divergence

• Always positive or zero. Zero means there is a perfect match between the two distributions.
• Invariant with regard to
• Permutation of the order in which the components of the vector random variable $\mathbf{x}$ are arranged.
• Ampltitude scaling
• Monotonic nonlinear transformation
• Related to mutual info $I(\mathbf{X};\mathbf{Y})=D_{f_{\mathbf{X},\mathbf{Y}}\Vert f_\mathbf{X} f_\mathbf{Y}}$

### Application

We can use mutual info as an objective function to be optimized when developing learning rules for neural networks.

• Maximize mutual info between input vector $\mathbf{X}$ and output vector $\mathbf{Y}$
• Maximize mutual info between $Y_a$ and $Y_b$ driven by near-by input vectors $\mathbf{X}_a$ and $\mathbf{X}_b$ from a single image.
• Minimize information between $Y_a$ and $Y_b$ driven by input vectors from different images.
• Minimize statistical dependence between $Y_i$s.

### Maximum mutual info principle

• Infomax principle by Linsker: Maximize $I(\mathbf{Y};\mathbf{X})$ for input vector $\mathbf{X}$ and output vector $\mathbf{Y}$.
• Basis for statistical signmal processing
• A mathematical framework for self-organization
• Relation to channel capacity, Shannon limit on the rate of info transmission through a communication channel.

### Example: Single neuron

#### Single neuron with additive output noise:

$Y=\left( \sum_{i=1}^{m} w_i X_i \right) + N$

where:

• $Y$ is the output
• $w_i$ is the weight
• $X_i$ is the input
• $N$ is the processing noise

Assumption:

• Output $Y$ is a Gaussian with variance $\sigma_Y^2$
• Noise $N$ is also a Gaussian with $\mu = 0$ and variance $\sigma_N^2$
• Input and noise are uncorrelated: $E[X_i N] = 0$ for all $i$

Entropy:

• Mutual info between input and output: $I(Y;\mathbf{X})=h(Y)-h(Y\vert \mathbf{X})$
• Since $P(Y \vert X)=c+P(N)$, where $c$ is a constant. $h(Y\vert \mathbf{X}) = h(N)$
• Given $\mathbf{X}$, what remains in $Y$ is just noise $N$. So $I(Y;\mathbf{X})=h(Y)-h(N)$

For a Gaussian distribution:

$h(Y)=\frac{1}{2}[1+\log (2\pi \sigma_Y^2)]$

Finally:

$I(Y;\mathbf{X}) = \frac{1}{2} \log \left(\frac{\sigma_Y^2}{\sigma_N^2}\right)$

So mutual information $I(Y;\mathbf{X})$ can be maximized simply by maximizing the output variance $\sigma_Y^2$.

#### Single neuron with input noise:

$Y= \sum_{i=1}^{m} w_i \left( X_i +N_i\right) \\ = \sum_{i=1}^{m} w_i X_i + \underbrace{\sum_{i=1}^{m} w_i N_i}_{\text{call this } N’}$

$N'$ is also a Gaussian distribution, with variance: $\sigma_{N’}^2=\sum_{i=1}^{m} w_i^2 \sigma_N^2$

Finally:

$I(Y\vert \mathbf{X})=h(N’)=\frac{1}{2} \log\left(\frac{\sigma_Y^2}{\sigma_N^2\sum_{i=1}^{m} w_i^2}\right)$

### Lesson

Adopting a Gaussian noise model, we can invoke a “surrogate” mutual information computed relatively easily, just compute variance.

### Noiseless Network

• Transform a random vector $\mathbf{X}$ of arbitrary distribution to a new random vector $\mathbf{Y}$ of different distribution: $\mathbf{Y}=\mathbf{W}\mathbf{X}$
• Mutual info in this case: $I(\mathbf{Y};\mathbf{X})= H(\mathbf{Y}) - H(\mathbf{Y} \vert \mathbf{X})$
• Gradient: $\frac{\partial I(\mathbf{Y};\mathbf{X})}{\partial \mathbf{W}} = \frac{\partial H(\mathbf{Y})}{\partial \mathbf{W}}$
• Maximizing mutual info between input and output is equivalent to maximizing entropy in the output, both with respect to the weight vector $\mathbf{W}$

### Informax

• In Shannon’s framework, Order and structure = Redundancy.
• Increase in the above reduces uncertainty.
• More redundancy in the signal implies less information conveyed.
• Thus, Infomax principle leads to reduced reduncancy in output $Y$ compared to input $X$.
• When noise is present:
– Input noise: add redundancy in input to combat noise.
– Output noise: add more output components to combat noise.
– Highlevelofnoisefavorsredundancyofrepresentation.
– Lowlevelofnoisefavorsdiversityofrepresentation.

### Principle of minimum redundancy

Notation:

• Sensory signal $\mathbf{S}$
• Noise input $\mathbf{X}$
• Recording system $\mathbf{A}$
• Noisy output $\mathbf{Y}$

$\mathbf{X} = \mathbf{S} + \mathbf{N}_1$

$\mathbf{Y} = \mathbf{AX} + \mathbf{N}_2$

• Retinal input includes redundant information. Purpose of retinal coding is to reduce/eliminate the redundant bits of data due to correlations and noise, before sending the signal along the optic nerve.
• Redundancy measure (with channel capacity $C(\centerdot)$: $R=1-\frac{I(\mathbf{Y};\mathbf{S})}{C(\mathbf{Y})}$
• Objective: find recoder matrix $\mathbf{A}$ such that $R$ is minimized, subject to the no info loss constraint: $I(\mathbf{Y};\mathbf{X}) = I(\mathbf{X};\mathbf{X}) -\epsilon$
• When $\mathbf{S}$ and $\mathbf{Y}$ have the same dimensionality and there is no noise, principle of minimum redundancy is equivalent to the Infomax principle.
• Infomax on input/output lead to redundancy reduction

## Spatially coherent features

• Infomax for unsupervised processing of the image of natural scenes (Becker and Hinton, 1992).
• Goal: design a self-organizing system that is capable of learning to encode complex scene information in a simpler form.
• Objective: extract higher-order features that exhibit simple coherence across space so that representation for one spatial region can be used to produce that of representation of neighboring regions.

Let $S$ denote a signal component common to both $Y_a$ and $Y_b$. Output can be expressed in terms of $S$ and some noise:

$Y_a = S+N_a$

$Y_b = S+N_b$

where $N_a$ and $N_b$ are assumed to be independent and zero-mean Gaussian. Also assume $S$ is Gaussian.

The mutual info then:

$I(Y_a;Y_b) = h(Y_a) + h(Y_b) - h(Y_a,Y_b)$

With

$h(Y_a) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_a^2\right)\right]$

$h(Y_b) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_b^2\right)\right]$

$h(Y_a,Y_b) = 1+ \log(2\pi) + \frac{1}{2} \log \vert \det(\Sigma) \vert$

$\Sigma = \left[ \begin{array}{cc} \sigma_a^2 & \rho_{ab}\sigma_a \sigma_b \\ \rho_{ab}\sigma_a \sigma_b & \sigma_b^2 \end{array}\right]$

$\rho_{ab} = \frac{E[(Y_a -E[Y_a])(Y_b-E[Y_b])]}{\sigma_a \sigma_b}$

We get:

$I(Y_a;Y_b) = - \frac{1}{2} \log \left( 1 - \rho_{ab}^2\right)$

• Maximizing information is equivalent to maximizing correlation between $Y_a$ and $Y_b$.
• Relation to canonical correlation in statistics:
• Given random input vectors $\mathbf{X}_a$ and $\mathbf{X}_b$
• Find two weight vectors $\mathbf{w}_a$ and $\mathbf{w}_b$ so that
• $Y_a = \mathbf{w}_a^T\mathbf{X}_a$ and $Y_a = \mathbf{w}_b^T\mathbf{X}_b$ have maximum correlation between them.
• Applications: stereo disparity extraction (Becker and Hinton, 1992)
• 2 images slightly different, find correlated part.
• When the inputs come from two separate regions, we want to minimize the mutual information between the two outputs (Ukrainec and Haykin, 1992, 1996).
• Applications include when input sources such as different polarizations of the signal are imaged: mutual information between outputs driven by two orthogonal polarizations should be minimized.

## Independent Components Analysis (ICA)

• Unknown random source vector $\mathbf{U}(n)$: $\mathbf{U} = [U_1,U_2,…,U_m]^T$ where $m$ components are supplied by a set of independent sources. Note that we need a series of source vectors.
• $\mathbf{U}$ is transformed by an unknown mixing matrix $\mathbf{A}$: $\mathbf{X}=\mathbf{A}\mathbf{U}$ where $\mathbf{X} = [X_1,X_2,…,X_m]^T$

• Both $\mathbf{A}$ and $\mathbf{U}$ are unknown.
• Task: find an estimate of the inverse of the mixing matrix(the demixing matrix $\mathbf{W}$) to recover the unknown source $\mathbf{U}$ $\mathbf{Y}=\mathbf{W}\mathbf{X}$
• Known as blind source separation problem
• Solution can be obtained by enforcing independence among components of $\mathbf{Y}$ while adjusting $\mathbf{W}$

### Ambiguities

Consider $\mathbf{X}=\mathbf{A}\mathbf{U}$ and $\mathbf{Y}=\mathbf{W}\mathbf{X}$

• Permutation: $\mathbf{X}=\mathbf{A}\mathbf{P}^{-1}\mathbf{P}\mathbf{U}$, where $\mathbf{P}$ is a permutation matrix.
• $\mathbf{P}$ is like $\left[ \begin{array}{cccc} 0 & 1 & 0 &0 \\ 1&0&0&0 \\ 0&0&1&0 \\ 0&0&0&1 \end{array}\right]$
• Sign: the model is unaffected by multiplication of one of the sources by $-1$
• Scaling (variance): estimate scaling up $\mathbf{U}$ and scaling down $\mathbf{A}$ will given the same $\mathbf{X}$

### Independence

• Two random variable $X$ and $Y$ are statistically independent when $f_{X,Y}(x,y) = f_X(x)f_Y(y)$
• Weaker form: $E[XY]=E[X]E[Y]$
• Gaussian are bad: When the unknown source is Gaussian, any orthogonal transformation $A$ results in the same Gaussian distribution.

#### Non-Gaussianity

• Non-Gaussianity can be used as a measure of independence
• One component of $\mathbf{Y}$: $Y_i = \underbrace{[W_{i1},W_{i2},…,W_{im}]\mathbf{A}}_{\text{call this } \mathbf{Z}_i^T}\mathbf{U}$
• So $Y_i$ is a linear combination of random variable $U_k$ ($Y_i=\sum_{j=1}^{m}Z_{ij}U_j$). So it is more Gaussian than any individual $U_k$. Gaussianity is minimized when $Y_i$ equals one of $U_k$.

### ICA Mearsures of Non-Gaussianity

• Kurtosis
• Negentropy
• etc.

#### Kurtosis

• Kurtosis is the fourth-order cumulant: $Kurtosis(Y)=E[Y^4]-3\left(E[Y^2]\right)^2$
• Gaussian have 0 kurtosis.
• More peaked distributions have positive kurtosis.
• More flatter distributions have negative kurtosis.
• Learning: Start with random $\mathbf{W}$, adjust $\mathbf{W}$ and mearsure change in kurtosis, using gradient-based methods.
• Drawback: Kurtosis is sensitive to outliers, and thus not robust.

#### Negentropy

• Negentropy $J$ is $J(\mathbf{Y}) = H(\mathbf{Y}_{gauss})-H(\mathbf{Y})$ where $\mathbf{Y}_{gauss}$ is a Gaussian random variable that has the same covariance matrix as $\mathbf{Y}$
• Negentropy is always non-zero, and it’s zero iff $\mathbf{Y}$ is Gaussian.
• Thus, maximizing negentropy is to maximize non-Gaussianity
• Problem: estimating negentropy is difficult, and requires the knowledge of the pdfs.

### Approximation of Negentropy

• Classical method(not robust): $J(Y) \approx \frac{1}{2}E[Y^3]^2+\frac{1}{48}\text{Kurtosis}(Y)^2$
• Another: $J(Y) \approx \sum_{k=1}^{p}k_i(E[G_i(Y)]-E[G_i(N)])^2$
• $k_i$ are coefficients
• $G_i(\centerdot)$ are nonquadratic functions
• $N$ is a zero-mean,unit-variance Gaussian
• Further simplified: $J(Y) \approx (E[G(Y)]-E[G(N)])^2$

### ICA Minimizing mutual info

• We can also aim to minimize mutual info between $Y_i$
• This turns out to be equivalent to maximizing negentropy (when $Y_i$ have unit variance) $I(Y_1;Y_2;…;Y_m) = C - \sum_i J(Y_i)$ where $C$ is a constant independent from $\mathbf{W}$

### Achieving independence

• Given output vector $\mathbf{Y}$, we want $Y_i$ and $Y_j$ to be statistically independent.
• Can be achieved by $I(Y_i;Y_j)=0$
• Another method: make pdf $f_{\mathbf{Y}}(\mathbf{y},\mathbf{W})$ to approach the factorial distribution: $f_{\mathbf{Y}}(\mathbf{y},\mathbf{W}) = \prod_{i=1}^{m}\tilde{f}_{\mathbf{Y}}(y_i,\mathbf{W})$ where $\tilde{f}_{\mathbf{Y}}(y_i,\mathbf{W})$ is the marginal pdf of $Y_i$. This can be measured by $D_{f\Vert\tilde{f}}(\mathbf{W})$

KL divergence is

$D_{f\Vert\tilde{f}}(\mathbf{W}) = -h(\mathbf{Y})+\sum_{i=1}^{m}\tilde{h}(Y_i)$

Output entropy:

$h(\mathbf{Y})=h(\mathbf{WX})=h(\mathbf{X})+\log \vert \det (\mathbf{W})\vert$

Tricky to calculate $\tilde{h}(Y_i)$, need a polynomial activation function $\varphi(y_i)$

### Learning Weight

• Learning objective is to minimize the KL divergence $D_{f\Vert\tilde{f}}$
• Gradient descent: $\Delta w_{ik}=-\eta \frac{\partial}{\partial w_{ik}} D_{f\Vert\tilde{f}} \\ = \eta \left( (\mathbf{W}^{-T})_{ik}-\varphi(y_i)x_k\right)$
• Final rule: $\mathbf{W}(n+1) = \mathbf{W}(n) + \eta(n) \left[\mathbf{I} -\varphi(\mathbf{y}(n))\mathbf{y}^T(n)\right]\mathbf{W}^{-T}(n)$