TAMU Neural Network 10 Information-Theoretic Models

Motivation

Shannon (1948) developed Information Theory concerning the communication process.

It is a deep mathematical theory concerned with the essence of the communication process.
Provides a framework for: efficiency of information representation, limitations in reliable transmission of information over a communication channel.
Gives bounds on optimum representation and transmission of signals

Motivation

Maximum mutual information principle (Linsker 1988): Synaptic connections of a multilayered neural network develop in such a way as to maximize the amount of information preserved when signals are transformed at each processing stage of the network, subject to certain constraints.
Redundancy reduction (Attneave 1954): “Major function of perceptual machinary is to strip away some of the redundancy of stimulation, to describe or encode information in a form more economical than that in which it impinges on the receptors”. In other words, redundancy reduction = feature extraction.

Topics

Entropy
Mutual information
Relative entropy
Differential entropy of continuous random variables

Entropy

Notation: $X$ random variable, $x$ value of random variable.
$X$ can be quantized into a finite number of discrete level although it can take continuous values. \[X={ x_k \vert k=0, \pm 1,…, \pm K}\]
$X = x_k$ occur with probability \[p_k = P(X=x_k)\] with requirement \[0 \le p_k \le 1, \sum_{k=-K}^{K} p_k=1\]

Uncertainty, Surprise, Info and Entropy

If , where is observed. There is no surprise, less uncertain.
- High probability leads to less surprise and less uncertain
Gain information when going from a high-uncertainty state to a low-uncertainty state.

Entropy

Uncertainty measurement for event $X=x_k$ ( $\log$ assumes $\log_2)$ :

\[I(x_k) = \log(\frac{1}{p_k}) = - \log p_k\]

Entropy of a random variable: Average uncertainty

\[H(X) = E[I(x_k)] = - \sum_{k=-K}^{K} p_k\log p_k\]

Properties

The Higher the $H(X)$ , the higher the potential information you can gain through observation/measurement.
Bounds on the entropy: \[0 \le H(X) \le \log(2K+1)\]
- $H(X)=0$ when $p_k=1$ and $p_j=0$ for $j \ne k$ : No uncertainty
- $H(K) = \log(2K+1)$ when $p_k = 1/(2K+1)$ for all $k$ : Maximum uncertainty, when all events are equiprobable.
For two probability distributions $\{p_k\}$ and $\{q_k\}$ : \[\sum_k p_k \log \left( \frac{p_k}{q_k} \right) \ge 0\]
Kullback-Leibler divergence (relative entropy): \[D_{p\Vert q} = \sum_{x \in \mathcal{X}} p_X(x) \log \left(\frac{p_X(x)}{q_X(x)}\right)\] meaures how different two probability distributions are.

Differential entropy

\[h(X)=-\int_{-\infty}^{\infty} f_X(x)\log f_X(x) dx = -E[\log f_X(x)]\]

$H(X)$ in the limit does not equal $h(X)$ :

\[H(X)=h(X)-\lim_{\delta x \rightarrow 0}\log \delta x\]

Properties

$h(X+c) = h(X)$
$h(aX)=h(X)+ \log \vert a \vert$
For vector random variable $\mathbf{X}$ \[h(\mathbf{AX}) = h(\mathbf{X}) + \log \vert \det(\mathbf{A}) \vert\]

Maximum entropy principle

Pick the probability distribution that maximizes the entropy, subject to constraints on the distribution.

1-D Gaussian distribution

For a given variance $\sigma^2$ , Gaussian random variable has the largest differential entropy
The entropy of a Gaussian random variable $X$ is uniquely determined by the variance of $X$

Mutual Information

Conditional entropy: entropy in $X$ after observing $Y$ (How much uncertainty) \[H(X\vert Y) = H(X,Y)-H(Y)\]
Joint-entropy is defined as: \[H(X,Y)=-\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)\]
Mutual Info: How much uncertainty is reduced in $X$ when we observe $Y$ ? \[I(X;Y)=H(X)-H(X \vert Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\]

Continuous random variable

\[I(X;Y)=h(X)-h(X \vert Y) \\ = h(Y)-h(Y \vert X) \\ = h(X)+ h(Y)-h(X , Y)\]

\[I(X;Y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x,y) \log \left( \frac{f_X(x\vert y)}{f_X(x)}\right)dxdy\]

Properties of KL Divergence

Always positive or zero. Zero means there is a perfect match between the two distributions.
Invariant with regard to
- Permutation of the order in which the components of the vector random variable $\mathbf{x}$ are arranged.
- Ampltitude scaling
- Monotonic nonlinear transformation
Related to mutual info \[I(\mathbf{X};\mathbf{Y})=D_{f_{\mathbf{X},\mathbf{Y}}\Vert f_\mathbf{X} f_\mathbf{Y}}\]

Application

We can use mutual info as an objective function to be optimized when developing learning rules for neural networks.

Maximize mutual info between input vector $\mathbf{X}$ and output vector $\mathbf{Y}$
Maximize mutual info between $Y_a$ and $Y_b$ driven by near-by input vectors $\mathbf{X}_a$ and $\mathbf{X}_b$ from a single image.
Minimize information between $Y_a$ and $Y_b$ driven by input vectors from different images.
Minimize statistical dependence between $Y_i$ s.

Maximum mutual info principle

Infomax principle by Linsker: Maximize $I(\mathbf{Y};\mathbf{X})$ for input vector $\mathbf{X}$ and output vector $\mathbf{Y}$ .
Basis for statistical signmal processing
A mathematical framework for self-organization
Relation to channel capacity, Shannon limit on the rate of info transmission through a communication channel.

Example: Single neuron

Single neuron with additive output noise:

\[Y=\left( \sum_{i=1}^{m} w_i X_i \right) + N\]

where:

$Y$ is the output
$w_i$ is the weight
$X_i$ is the input
$N$ is the processing noise

Assumption:

Output $Y$ is a Gaussian with variance $\sigma_Y^2$
Noise $N$ is also a Gaussian with $\mu = 0$ and variance $\sigma_N^2$
Input and noise are uncorrelated: $E[X_i N] = 0$ for all $i$

Entropy:

Mutual info between input and output: \[I(Y;\mathbf{X})=h(Y)-h(Y\vert \mathbf{X})\]
Since $P(Y \vert X)=c+P(N)$ , where $c$ is a constant. \[h(Y\vert \mathbf{X}) = h(N)\]
Given $\mathbf{X}$ , what remains in $Y$ is just noise $N$ . So \[I(Y;\mathbf{X})=h(Y)-h(N)\]

For a Gaussian distribution:

\[h(Y)=\frac{1}{2}[1+\log (2\pi \sigma_Y^2)]\]

Finally:

\[I(Y;\mathbf{X}) = \frac{1}{2} \log \left(\frac{\sigma_Y^2}{\sigma_N^2}\right)\]

So mutual information $I(Y;\mathbf{X})$ can be maximized simply by maximizing the output variance $\sigma_Y^2$ .

Single neuron with input noise:

\[Y= \sum_{i=1}^{m} w_i \left( X_i +N_i\right) \\ = \sum_{i=1}^{m} w_i X_i + \underbrace{\sum_{i=1}^{m} w_i N_i}_{\text{call this } N’}\]

$N'$ is also a Gaussian distribution, with variance: \[\sigma_{N’}^2=\sum_{i=1}^{m} w_i^2 \sigma_N^2\]

Finally:

\[I(Y\vert \mathbf{X})=h(N’)=\frac{1}{2} \log\left(\frac{\sigma_Y^2}{\sigma_N^2\sum_{i=1}^{m} w_i^2}\right)\]

Lesson

Adopting a Gaussian noise model, we can invoke a “surrogate” mutual information computed relatively easily, just compute variance.

Noiseless Network

Transform a random vector $\mathbf{X}$ of arbitrary distribution to a new random vector $\mathbf{Y}$ of different distribution: $\mathbf{Y}=\mathbf{W}\mathbf{X}$
Mutual info in this case: \[I(\mathbf{Y};\mathbf{X})= H(\mathbf{Y}) - H(\mathbf{Y} \vert \mathbf{X})\]
Gradient: \[\frac{\partial I(\mathbf{Y};\mathbf{X})}{\partial \mathbf{W}} = \frac{\partial H(\mathbf{Y})}{\partial \mathbf{W}}\]
Maximizing mutual info between input and output is equivalent to maximizing entropy in the output, both with respect to the weight vector $\mathbf{W}$

Informax

In Shannon’s framework, Order and structure = Redundancy.
Increase in the above reduces uncertainty.
More redundancy in the signal implies less information conveyed.
More information conveyed means less redundancy.
Thus, Infomax principle leads to reduced reduncancy in output $Y$ compared to input $X$ .
When noise is present:
– Input noise: add redundancy in input to combat noise.
– Output noise: add more output components to combat noise.
– Highlevelofnoisefavorsredundancyofrepresentation.
– Lowlevelofnoisefavorsdiversityofrepresentation.

Principle of minimum redundancy

Notation:

Sensory signal $\mathbf{S}$
Noise input $\mathbf{X}$
Recording system $\mathbf{A}$
Noisy output $\mathbf{Y}$

\[\mathbf{X} = \mathbf{S} + \mathbf{N}_1\]

\[\mathbf{Y} = \mathbf{AX} + \mathbf{N}_2\]

Retinal input includes redundant information. Purpose of retinal coding is to reduce/eliminate the redundant bits of data due to correlations and noise, before sending the signal along the optic nerve.
Redundancy measure (with channel capacity $C(\centerdot)$ : \[R=1-\frac{I(\mathbf{Y};\mathbf{S})}{C(\mathbf{Y})}\]
Objective: find recoder matrix $\mathbf{A}$ such that $R$ is minimized, subject to the no info loss constraint: \[I(\mathbf{Y};\mathbf{X}) = I(\mathbf{X};\mathbf{X}) -\epsilon\]
When $\mathbf{S}$ and $\mathbf{Y}$ have the same dimensionality and there is no noise, principle of minimum redundancy is equivalent to the Infomax principle.
Infomax on input/output lead to redundancy reduction

Spatially coherent features

Infomax for unsupervised processing of the image of natural scenes (Becker and Hinton, 1992).
Goal: design a self-organizing system that is capable of learning to encode complex scene information in a simpler form.
Objective: extract higher-order features that exhibit simple coherence across space so that representation for one spatial region can be used to produce that of representation of neighboring regions.

Let $S$ denote a signal component common to both $Y_a$ and $Y_b$ . Output can be expressed in terms of $S$ and some noise:

\[Y_a = S+N_a\]

\[Y_b = S+N_b\]

where $N_a$ and $N_b$ are assumed to be independent and zero-mean Gaussian. Also assume $S$ is Gaussian.

The mutual info then:

\[I(Y_a;Y_b) = h(Y_a) + h(Y_b) - h(Y_a,Y_b)\]

With

\[h(Y_a) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_a^2\right)\right]\]

\[h(Y_b) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_b^2\right)\right]\]

\[h(Y_a,Y_b) = 1+ \log(2\pi) + \frac{1}{2} \log \vert \det(\Sigma) \vert\]

\[\Sigma = \left[ \begin{array}{cc} \sigma_a^2 & \rho_{ab}\sigma_a \sigma_b \\ \rho_{ab}\sigma_a \sigma_b & \sigma_b^2 \end{array}\right]\]

\[\rho_{ab} = \frac{E[(Y_a -E[Y_a])(Y_b-E[Y_b])]}{\sigma_a \sigma_b}\]

We get:

\[I(Y_a;Y_b) = - \frac{1}{2} \log \left( 1 - \rho_{ab}^2\right)\]

Maximizing information is equivalent to maximizing correlation between $Y_a$ and $Y_b$ .
Relation to canonical correlation in statistics:
- Given random input vectors $\mathbf{X}_a$ and $\mathbf{X}_b$
- Find two weight vectors $\mathbf{w}_a$ and $\mathbf{w}_b$ so that
- $Y_a = \mathbf{w}_a^T\mathbf{X}_a$ and $Y_a = \mathbf{w}_b^T\mathbf{X}_b$ have maximum correlation between them.
Applications: stereo disparity extraction (Becker and Hinton, 1992)
- 2 images slightly different, find correlated part.
When the inputs come from two separate regions, we want to minimize the mutual information between the two outputs (Ukrainec and Haykin, 1992, 1996).
Applications include when input sources such as different polarizations of the signal are imaged: mutual information between outputs driven by two orthogonal polarizations should be minimized.

Independent Components Analysis (ICA)

Unknown random source vector $\mathbf{U}(n)$ : \[\mathbf{U} = [U_1,U_2,…,U_m]^T\] where $m$ components are supplied by a set of independent sources. Note that we need a series of source vectors.
$\mathbf{U}$ is transformed by an unknown mixing matrix $\mathbf{A}$ : \[\mathbf{X}=\mathbf{A}\mathbf{U}\] where \[\mathbf{X} = [X_1,X_2,…,X_m]^T\]
Both $\mathbf{A}$ and $\mathbf{U}$ are unknown.
Task: find an estimate of the inverse of the mixing matrix(the demixing matrix $\mathbf{W}$ ) to recover the unknown source $\mathbf{U}$ \[\mathbf{Y}=\mathbf{W}\mathbf{X}\]
Known as blind source separation problem
Solution can be obtained by enforcing independence among components of $\mathbf{Y}$ while adjusting $\mathbf{W}$

Ambiguities

Consider $\mathbf{X}=\mathbf{A}\mathbf{U}$ and $\mathbf{Y}=\mathbf{W}\mathbf{X}$

Permutation: $\mathbf{X}=\mathbf{A}\mathbf{P}^{-1}\mathbf{P}\mathbf{U}$ , where $\mathbf{P}$ is a permutation matrix.
$\mathbf{P}$ is like \[\left[ \begin{array}{cccc} 0 & 1 & 0 &0 \\ 1&0&0&0 \\ 0&0&1&0 \\ 0&0&0&1 \end{array}\right]\]
Sign: the model is unaffected by multiplication of one of the sources by $-1$
Scaling (variance): estimate scaling up $\mathbf{U}$ and scaling down $\mathbf{A}$ will given the same $\mathbf{X}$

Independence

Two random variable $X$ and $Y$ are statistically independent when \[f_{X,Y}(x,y) = f_X(x)f_Y(y)\]
Weaker form: \[E[XY]=E[X]E[Y]\]
Gaussian are bad: When the unknown source is Gaussian, any orthogonal transformation $A$ results in the same Gaussian distribution.

Non-Gaussianity

Non-Gaussianity can be used as a measure of independence
One component of $\mathbf{Y}$ : \[Y_i = \underbrace{[W_{i1},W_{i2},…,W_{im}]\mathbf{A}}_{\text{call this } \mathbf{Z}_i^T}\mathbf{U}\]
So $Y_i$ is a linear combination of random variable $U_k$ ( $Y_i=\sum_{j=1}^{m}Z_{ij}U_j$ ). So it is more Gaussian than any individual $U_k$ . Gaussianity is minimized when $Y_i$ equals one of $U_k$ .

ICA Mearsures of Non-Gaussianity

Kurtosis
Negentropy
etc.

Kurtosis

Kurtosis is the fourth-order cumulant: \[Kurtosis(Y)=E[Y^4]-3\left(E[Y^2]\right)^2\]
Gaussian have 0 kurtosis.
More peaked distributions have positive kurtosis.
More flatter distributions have negative kurtosis.
Learning: Start with random $\mathbf{W}$ , adjust $\mathbf{W}$ and mearsure change in kurtosis, using gradient-based methods.
Drawback: Kurtosis is sensitive to outliers, and thus not robust.

Negentropy

Negentropy $J$ is \[J(\mathbf{Y}) = H(\mathbf{Y}_{gauss})-H(\mathbf{Y})\] where $\mathbf{Y}_{gauss}$ is a Gaussian random variable that has the same covariance matrix as $\mathbf{Y}$
Negentropy is always non-zero, and it’s zero iff $\mathbf{Y}$ is Gaussian.
Thus, maximizing negentropy is to maximize non-Gaussianity
Problem: estimating negentropy is difficult, and requires the knowledge of the pdfs.

Approximation of Negentropy

Classical method(not robust): \[J(Y) \approx \frac{1}{2}E[Y^3]^2+\frac{1}{48}\text{Kurtosis}(Y)^2\]
Another: \[J(Y) \approx \sum_{k=1}^{p}k_i(E[G_i(Y)]-E[G_i(N)])^2\]
- $k_i$ are coefficients
- $G_i(\centerdot)$ are nonquadratic functions
- $N$ is a zero-mean,unit-variance Gaussian
Further simplified: \[J(Y) \approx (E[G(Y)]-E[G(N)])^2\]
- $G_1(Y) = \frac{1}{a_1} \log \cosh a_1 Y$
- $G_2(Y) = - \exp(-Y^2/2)$

ICA Minimizing mutual info

We can also aim to minimize mutual info between $Y_i$
This turns out to be equivalent to maximizing negentropy (when $Y_i$ have unit variance) \[I(Y_1;Y_2;…;Y_m) = C - \sum_i J(Y_i)\] where $C$ is a constant independent from $\mathbf{W}$

Achieving independence

Given output vector $\mathbf{Y}$ , we want $Y_i$ and $Y_j$ to be statistically independent.
Can be achieved by $I(Y_i;Y_j)=0$
Another method: make pdf $f_{\mathbf{Y}}(\mathbf{y},\mathbf{W})$ to approach the factorial distribution: \[f_{\mathbf{Y}}(\mathbf{y},\mathbf{W}) = \prod_{i=1}^{m}\tilde{f}_{\mathbf{Y}}(y_i,\mathbf{W})\] where $\tilde{f}_{\mathbf{Y}}(y_i,\mathbf{W})$ is the marginal pdf of $Y_i$ . This can be measured by $D_{f\Vert\tilde{f}}(\mathbf{W})$

KL divergence is

\[D_{f\Vert\tilde{f}}(\mathbf{W}) = -h(\mathbf{Y})+\sum_{i=1}^{m}\tilde{h}(Y_i)\]

Output entropy:

\[h(\mathbf{Y})=h(\mathbf{WX})=h(\mathbf{X})+\log \vert \det (\mathbf{W})\vert\]

Tricky to calculate $\tilde{h}(Y_i)$ , need a polynomial activation function $\varphi(y_i)$

Learning Weight

Learning objective is to minimize the KL divergence $D_{f\Vert\tilde{f}}$
Gradient descent: \[\Delta w_{ik}=-\eta \frac{\partial}{\partial w_{ik}} D_{f\Vert\tilde{f}} \\ = \eta \left( (\mathbf{W}^{-T})_{ik}-\varphi(y_i)x_k\right)\]
Final rule: \[\mathbf{W}(n+1) = \mathbf{W}(n) + \eta(n) \left[\mathbf{I} -\varphi(\mathbf{y}(n))\mathbf{y}^T(n)\right]\mathbf{W}^{-T}(n)\]

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1