2017-02-01

Clustering

K-Means Algorithm

Input:

• K (number of clusters)
• Training set $\{ x^{(1)},x^{(2)},...x^{(m)}\}$

$x^{(i)} \in \mathbb{R}^n$ (drop $x_0=1$ convention)

Algorithm

Randomly initialize K cluster centroids $\mu_1, \mu_2, ..., \mu_K \in \mathbb{R}^n$

Repeat {
for $i=1 \mbox{ to } m$
$c^{(i)} :=$ index (from 1 to K) of cluster centroid closest to $x^{(i)}$; minimize $\vert \vert x^{(i)}-\mu_k \vert \vert^2$
for $k=1 \mbox{ to } K$
$\mu_k :=$ average (mean) of points assigned to cluster k
}

optimization objective

$c^{(i)}$ = index of cluster (1,2,…,K) to which example $x^{(i)}$ is currently assigned
$\mu_k$ = cluster centroid k ($\mu_k \in \mathbb{R}^n$)
$\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Optimization objective:
Minimize:
$J(c^{(1)},…c^{(m)},\mu_1,…,\mu_k)=\frac{1}{m} \sum_{i=1}^{m} \vert \vert x^{(i)}-\mu_{c^{(i)}} \vert \vert^2$

random initialization

Should have $% $

Randomly pick K training examples

Set $\mu_1,...,\mu_k$ equal to these K examples

For $i=1 \mbox{ to } 100$ (usually 50 - 1000) {
Randomly initialize K-means.
Run K-means. Get $c^{(1)},...c^{(m)},\mu_1,...,\mu_k$.
Compute cost function (distortion) $J(c^{(1)},...c^{(m)},\mu_1,...,\mu_k)$
}

Pick clustering that gave lowest cost $J(c^{(1)},...c^{(m)},\mu_1,...,\mu_k)$

choose the number of clusters

Chosen by hand (human insight) based on the purpose.
Can use elbow method

Motivation

data compression

Reduce data from 2D to 1D
Reduce data from 3D t0 1D; project to a plane

Principal Component Analysis

problem formulation

For the problem of dimensionality reduction, the most popular method.

Try to find a surface which to project the data so as to minimize the projection error (orthogonal distance between the point and the surface).

Reduce from n-dimension to k-dimension: Find k vectors $u^{(1)},u^{(2)},...,u^{(k)}$ onto which to project the data, so as to minimize the projection error.

Different from linear regression

Before:
perform mean normalization and feature scaling to have zero mean and comparable ranges.

algorithm

data preprocessing

Training set: $x^{(1)},x^{(2)},...,x^{(m)}$
Preprocessing (feature scaling/mean normalization):
$\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}$
Replace each $x_j^{(i)}$ with $x_j^{(i)} - \mu_j$
Scale features to have comparable range of values

Compute “covariance matrix”:
$\Sigma = \frac{1}{m} \sum_{i=1}^{n}(x^{(i)})(x^{(i)})^T$
This matrix has a property called symmetric positive definite
Compute “eigenvectors” of matrix $\Sigma$:

%X is m by n matrix
Sigma = 1/m*X'*X;
[U,S,V] = svd(Sigma);
%or, results are the same due to symmetric positive definite
eig(Sigma)
%
Ureduce = U(:,1:k);
z = Ureduce'*x;


Take first K columns of U matrix to form $U_{reduced}$
$z=U_{reduced}^T x$
z will be a $\mathbb{R}^K$

applying PCA

reconstruction from compressed representation

The approximate value on the projection surface:
$x_{approx}^{(i)} = U_{reduced} z^{(i)}$

choosing the number of principal components

Choosing k

Average squared projection error: $\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} - x_{approx}^{(i)} \vert \vert^2$
Total variation in the data: $\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} \vert \vert^2$

Choosing k to be the smallest value so that
$\frac{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} - x_{approx}^{(i)} \vert \vert^2}{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} \vert \vert^2} \le 0.01$
99% of variance is retained.

Algorithm:
Try PCA with $k = 1,2,...$
Compute $U_{reduced}, z^{(1)}, z^{(2)},...,z^{(m)}, x_{approx}^{(1)},...,x_{approx}^{(m)}$
Check if the ratio less than or equal to 0.01

[U,S,V] = svd(Sigma)


The ratio can be computed as, for given k:
$1-\frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^{n} S_{ii}} \le 0.01$

supervised learning speedup

Extract inputs:

Unlabeled data set $x^{(1)},x^{(2)},...,x^{(m)} \in \mathbb{R}^{10000}$ $\rightarrow$
$z^{(1)},z^{(2)},...,z^{(m)} \in \mathbb{R}^{1000}$
New training set: $(z^{(1)},y^{(1)}),(z^{(2)},y^{(2)}),...,(z^{(m)},y^{(m)})$

Mapping $x^{(i)}\rightarrow z^{(i)}$ should be defined by running PCA only on the training set. This mapping can be applied as well to the cross validation and test sets.

Not a good idea to reduce no. features to prevent overfitting

Try to do without PCA first.