Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Neural Network 10 Information-Theoretic Models



Shannon (1948) developed Information Theory concerning the communication process.

  • It is a deep mathematical theory concerned with the essence of the communication process.
  • Provides a framework for: efficiency of information representation, limitations in reliable transmission of information over a communication channel.
  • Gives bounds on optimum representation and transmission of signals


  • Maximum mutual information principle (Linsker 1988): Synaptic connections of a multilayered neural network develop in such a way as to maximize the amount of information preserved when signals are transformed at each processing stage of the network, subject to certain constraints.
  • Redundancy reduction (Attneave 1954): “Major function of perceptual machinary is to strip away some of the redundancy of stimulation, to describe or encode information in a form more economical than that in which it impinges on the receptors”. In other words, redundancy reduction = feature extraction.


  • Entropy
  • Mutual information
  • Relative entropy
  • Differential entropy of continuous random variables


  • Notation: random variable, value of random variable.
  • can be quantized into a finite number of discrete level although it can take continuous values. \[X={ x_k \vert k=0, \pm 1,…, \pm K}\]
  • occur with probability \[p_k = P(X=x_k)\] with requirement \[0 \le p_k \le 1, \sum_{k=-K}^{K} p_k=1\]

Uncertainty, Surprise, Info and Entropy

  • If , where is observed. There is no surprise, less uncertain.
    • High probability leads to less surprise and less uncertain
  • Gain information when going from a high-uncertainty state to a low-uncertainty state.


Uncertainty measurement for event ( assumes :

\[I(x_k) = \log(\frac{1}{p_k}) = - \log p_k\]

Entropy of a random variable: Average uncertainty

\[H(X) = E[I(x_k)] = - \sum_{k=-K}^{K} p_k\log p_k\]


  • The Higher the , the higher the potential information you can gain through observation/measurement.
  • Bounds on the entropy: \[0 \le H(X) \le \log(2K+1)\]
    • when and for : No uncertainty
    • when for all : Maximum uncertainty, when all events are equiprobable.
  • For two probability distributions and : \[\sum_k p_k \log \left( \frac{p_k}{q_k} \right) \ge 0\]
  • Kullback-Leibler divergence (relative entropy): \[D_{p\Vert q} = \sum_{x \in \mathcal{X}} p_X(x) \log \left(\frac{p_X(x)}{q_X(x)}\right)\] meaures how different two probability distributions are.

Differential entropy

\[h(X)=-\int_{-\infty}^{\infty} f_X(x)\log f_X(x) dx = -E[\log f_X(x)]\]

in the limit does not equal :

\[H(X)=h(X)-\lim_{\delta x \rightarrow 0}\log \delta x\]


  • For vector random variable \[h(\mathbf{AX}) = h(\mathbf{X}) + \log \vert \det(\mathbf{A}) \vert\]

Maximum entropy principle

Pick the probability distribution that maximizes the entropy, subject to constraints on the distribution.

1-D Gaussian distribution

  • For a given variance , Gaussian random variable has the largest differential entropy
  • The entropy of a Gaussian random variable is uniquely determined by the variance of

Mutual Information

  • Conditional entropy: entropy in after observing (How much uncertainty) \[H(X\vert Y) = H(X,Y)-H(Y)\]
  • Joint-entropy is defined as: \[H(X,Y)=-\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y)\]
  • Mutual Info: How much uncertainty is reduced in when we observe ? \[I(X;Y)=H(X)-H(X \vert Y) = \sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\]

Continuous random variable

\[I(X;Y)=h(X)-h(X \vert Y) \\ = h(Y)-h(Y \vert X) \\ = h(X)+ h(Y)-h(X , Y)\]

\[I(X;Y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x,y) \log \left( \frac{f_X(x\vert y)}{f_X(x)}\right)dxdy\]

Properties of KL Divergence

  • Always positive or zero. Zero means there is a perfect match between the two distributions.
  • Invariant with regard to
    • Permutation of the order in which the components of the vector random variable are arranged.
    • Ampltitude scaling
    • Monotonic nonlinear transformation
  • Related to mutual info \[I(\mathbf{X};\mathbf{Y})=D_{f_{\mathbf{X},\mathbf{Y}}\Vert f_\mathbf{X} f_\mathbf{Y}}\]


We can use mutual info as an objective function to be optimized when developing learning rules for neural networks.

  • Maximize mutual info between input vector and output vector
  • Maximize mutual info between and driven by near-by input vectors and from a single image.
  • Minimize information between and driven by input vectors from different images.
  • Minimize statistical dependence between s.

Maximum mutual info principle

  • Infomax principle by Linsker: Maximize for input vector and output vector .
  • Basis for statistical signmal processing
  • A mathematical framework for self-organization
  • Relation to channel capacity, Shannon limit on the rate of info transmission through a communication channel.

Example: Single neuron

Single neuron with additive output noise:

\[Y=\left( \sum_{i=1}^{m} w_i X_i \right) + N\]


  • is the output
  • is the weight
  • is the input
  • is the processing noise


  • Output is a Gaussian with variance
  • Noise is also a Gaussian with and variance
  • Input and noise are uncorrelated: for all


  • Mutual info between input and output: \[I(Y;\mathbf{X})=h(Y)-h(Y\vert \mathbf{X})\]
  • Since , where is a constant. \[h(Y\vert \mathbf{X}) = h(N)\]
  • Given , what remains in is just noise . So \[I(Y;\mathbf{X})=h(Y)-h(N)\]

For a Gaussian distribution:

\[h(Y)=\frac{1}{2}[1+\log (2\pi \sigma_Y^2)]\]


\[I(Y;\mathbf{X}) = \frac{1}{2} \log \left(\frac{\sigma_Y^2}{\sigma_N^2}\right)\]

So mutual information can be maximized simply by maximizing the output variance .

Single neuron with input noise:

\[Y= \sum_{i=1}^{m} w_i \left( X_i +N_i\right) \\ = \sum_{i=1}^{m} w_i X_i + \underbrace{\sum_{i=1}^{m} w_i N_i}_{\text{call this } N’}\]

is also a Gaussian distribution, with variance: \[\sigma_{N’}^2=\sum_{i=1}^{m} w_i^2 \sigma_N^2\]


\[I(Y\vert \mathbf{X})=h(N’)=\frac{1}{2} \log\left(\frac{\sigma_Y^2}{\sigma_N^2\sum_{i=1}^{m} w_i^2}\right)\]


Adopting a Gaussian noise model, we can invoke a “surrogate” mutual information computed relatively easily, just compute variance.

Noiseless Network

  • Transform a random vector of arbitrary distribution to a new random vector of different distribution:
  • Mutual info in this case: \[I(\mathbf{Y};\mathbf{X})= H(\mathbf{Y}) - H(\mathbf{Y} \vert \mathbf{X})\]
  • Gradient: \[\frac{\partial I(\mathbf{Y};\mathbf{X})}{\partial \mathbf{W}} = \frac{\partial H(\mathbf{Y})}{\partial \mathbf{W}}\]
  • Maximizing mutual info between input and output is equivalent to maximizing entropy in the output, both with respect to the weight vector


  • In Shannon’s framework, Order and structure = Redundancy.
  • Increase in the above reduces uncertainty.
  • More redundancy in the signal implies less information conveyed.
  • More information conveyed means less redundancy.
  • Thus, Infomax principle leads to reduced reduncancy in output compared to input .
  • When noise is present:
    – Input noise: add redundancy in input to combat noise.
    – Output noise: add more output components to combat noise.
    – Highlevelofnoisefavorsredundancyofrepresentation.
    – Lowlevelofnoisefavorsdiversityofrepresentation.

Principle of minimum redundancy


  • Sensory signal
  • Noise input
  • Recording system
  • Noisy output

\[\mathbf{X} = \mathbf{S} + \mathbf{N}_1\]

\[\mathbf{Y} = \mathbf{AX} + \mathbf{N}_2\]

  • Retinal input includes redundant information. Purpose of retinal coding is to reduce/eliminate the redundant bits of data due to correlations and noise, before sending the signal along the optic nerve.
  • Redundancy measure (with channel capacity : \[R=1-\frac{I(\mathbf{Y};\mathbf{S})}{C(\mathbf{Y})}\]
  • Objective: find recoder matrix such that is minimized, subject to the no info loss constraint: \[I(\mathbf{Y};\mathbf{X}) = I(\mathbf{X};\mathbf{X}) -\epsilon\]
  • When and have the same dimensionality and there is no noise, principle of minimum redundancy is equivalent to the Infomax principle.
  • Infomax on input/output lead to redundancy reduction

Spatially coherent features

  • Infomax for unsupervised processing of the image of natural scenes (Becker and Hinton, 1992).
  • Goal: design a self-organizing system that is capable of learning to encode complex scene information in a simpler form.
  • Objective: extract higher-order features that exhibit simple coherence across space so that representation for one spatial region can be used to produce that of representation of neighboring regions.

Let denote a signal component common to both and . Output can be expressed in terms of and some noise:

\[Y_a = S+N_a\]

\[Y_b = S+N_b\]

where and are assumed to be independent and zero-mean Gaussian. Also assume is Gaussian.

The mutual info then:

\[I(Y_a;Y_b) = h(Y_a) + h(Y_b) - h(Y_a,Y_b)\]


\[h(Y_a) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_a^2\right)\right]\]

\[h(Y_b) = \frac{1}{2}\left[ 1+ \log \left( 2 \pi \sigma_b^2\right)\right]\]

\[h(Y_a,Y_b) = 1+ \log(2\pi) + \frac{1}{2} \log \vert \det(\Sigma) \vert\]

\[\Sigma = \left[ \begin{array}{cc} \sigma_a^2 & \rho_{ab}\sigma_a \sigma_b \\ \rho_{ab}\sigma_a \sigma_b & \sigma_b^2 \end{array}\right]\]

\[\rho_{ab} = \frac{E[(Y_a -E[Y_a])(Y_b-E[Y_b])]}{\sigma_a \sigma_b}\]

We get:

\[I(Y_a;Y_b) = - \frac{1}{2} \log \left( 1 - \rho_{ab}^2\right)\]

  • Maximizing information is equivalent to maximizing correlation between and .
  • Relation to canonical correlation in statistics:
    • Given random input vectors and
    • Find two weight vectors and so that
    • and have maximum correlation between them.
  • Applications: stereo disparity extraction (Becker and Hinton, 1992)
    • 2 images slightly different, find correlated part.
  • When the inputs come from two separate regions, we want to minimize the mutual information between the two outputs (Ukrainec and Haykin, 1992, 1996).
  • Applications include when input sources such as different polarizations of the signal are imaged: mutual information between outputs driven by two orthogonal polarizations should be minimized.

Independent Components Analysis (ICA)

  • Unknown random source vector : \[\mathbf{U} = [U_1,U_2,…,U_m]^T\] where components are supplied by a set of independent sources. Note that we need a series of source vectors.
  • is transformed by an unknown mixing matrix : \[\mathbf{X}=\mathbf{A}\mathbf{U}\] where \[\mathbf{X} = [X_1,X_2,…,X_m]^T\]

  • Both and are unknown.
  • Task: find an estimate of the inverse of the mixing matrix(the demixing matrix ) to recover the unknown source \[\mathbf{Y}=\mathbf{W}\mathbf{X}\]
  • Known as blind source separation problem
  • Solution can be obtained by enforcing independence among components of while adjusting


Consider and

  • Permutation: , where is a permutation matrix.
  • is like \[\left[ \begin{array}{cccc} 0 & 1 & 0 &0 \\ 1&0&0&0 \\ 0&0&1&0 \\ 0&0&0&1 \end{array}\right]\]
  • Sign: the model is unaffected by multiplication of one of the sources by
  • Scaling (variance): estimate scaling up and scaling down will given the same


  • Two random variable and are statistically independent when \[f_{X,Y}(x,y) = f_X(x)f_Y(y)\]
  • Weaker form: \[E[XY]=E[X]E[Y]\]
  • Gaussian are bad: When the unknown source is Gaussian, any orthogonal transformation results in the same Gaussian distribution.


  • Non-Gaussianity can be used as a measure of independence
  • One component of : \[Y_i = \underbrace{[W_{i1},W_{i2},…,W_{im}]\mathbf{A}}_{\text{call this } \mathbf{Z}_i^T}\mathbf{U}\]
  • So is a linear combination of random variable (). So it is more Gaussian than any individual . Gaussianity is minimized when equals one of .

ICA Mearsures of Non-Gaussianity

  • Kurtosis
  • Negentropy
  • etc.


  • Kurtosis is the fourth-order cumulant: \[Kurtosis(Y)=E[Y^4]-3\left(E[Y^2]\right)^2\]
  • Gaussian have 0 kurtosis.
  • More peaked distributions have positive kurtosis.
  • More flatter distributions have negative kurtosis.
  • Learning: Start with random , adjust and mearsure change in kurtosis, using gradient-based methods.
  • Drawback: Kurtosis is sensitive to outliers, and thus not robust.


  • Negentropy is \[J(\mathbf{Y}) = H(\mathbf{Y}_{gauss})-H(\mathbf{Y})\] where is a Gaussian random variable that has the same covariance matrix as
  • Negentropy is always non-zero, and it’s zero iff is Gaussian.
  • Thus, maximizing negentropy is to maximize non-Gaussianity
  • Problem: estimating negentropy is difficult, and requires the knowledge of the pdfs.

Approximation of Negentropy

  • Classical method(not robust): \[J(Y) \approx \frac{1}{2}E[Y^3]^2+\frac{1}{48}\text{Kurtosis}(Y)^2\]
  • Another: \[J(Y) \approx \sum_{k=1}^{p}k_i(E[G_i(Y)]-E[G_i(N)])^2\]
    • are coefficients
    • are nonquadratic functions
    • is a zero-mean,unit-variance Gaussian
  • Further simplified: \[J(Y) \approx (E[G(Y)]-E[G(N)])^2\]

ICA Minimizing mutual info

  • We can also aim to minimize mutual info between
  • This turns out to be equivalent to maximizing negentropy (when have unit variance) \[I(Y_1;Y_2;…;Y_m) = C - \sum_i J(Y_i)\] where is a constant independent from

Achieving independence

  • Given output vector , we want and to be statistically independent.
  • Can be achieved by
  • Another method: make pdf to approach the factorial distribution: \[f_{\mathbf{Y}}(\mathbf{y},\mathbf{W}) = \prod_{i=1}^{m}\tilde{f}_{\mathbf{Y}}(y_i,\mathbf{W})\] where is the marginal pdf of . This can be measured by

KL divergence is

\[D_{f\Vert\tilde{f}}(\mathbf{W}) = -h(\mathbf{Y})+\sum_{i=1}^{m}\tilde{h}(Y_i)\]

Output entropy:

\[h(\mathbf{Y})=h(\mathbf{WX})=h(\mathbf{X})+\log \vert \det (\mathbf{W})\vert\]

Tricky to calculate , need a polynomial activation function

Learning Weight

  • Learning objective is to minimize the KL divergence
  • Gradient descent: \[\Delta w_{ik}=-\eta \frac{\partial}{\partial w_{ik}} D_{f\Vert\tilde{f}} \\ = \eta \left( (\mathbf{W}^{-T})_{ik}-\varphi(y_i)x_k\right)\]
  • Final rule: \[\mathbf{W}(n+1) = \mathbf{W}(n) + \eta(n) \left[\mathbf{I} -\varphi(\mathbf{y}(n))\mathbf{y}^T(n)\right]\mathbf{W}^{-T}(n)\]

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2019. All rights reserved by melonskin. Powered by Jekyll.