2017-02-01

# Clustering

## K-Means Algorithm

### Input:

• K (number of clusters)
• Training set $\{ x^{(1)},x^{(2)},...x^{(m)}\}$

$x^{(i)} \in \mathbb{R}^n$ (drop $x_0=1$ convention)

### Algorithm

Randomly initialize K cluster centroids $\mu_1, \mu_2, ..., \mu_K \in \mathbb{R}^n$

Repeat {
for $i=1 \mbox{ to } m$
$c^{(i)} :=$ index (from 1 to K) of cluster centroid closest to $x^{(i)}$; minimize $\vert \vert x^{(i)}-\mu_k \vert \vert^2$
for $k=1 \mbox{ to } K$
$\mu_k :=$ average (mean) of points assigned to cluster k
}

## optimization objective

$c^{(i)}$ = index of cluster (1,2,…,K) to which example $x^{(i)}$ is currently assigned
$\mu_k$ = cluster centroid k ($\mu_k \in \mathbb{R}^n$)
$\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Optimization objective:
Minimize:
$J(c^{(1)},…c^{(m)},\mu_1,…,\mu_k)=\frac{1}{m} \sum_{i=1}^{m} \vert \vert x^{(i)}-\mu_{c^{(i)}} \vert \vert^2$

## random initialization

Should have $% $

Randomly pick K training examples

Set $\mu_1,...,\mu_k$ equal to these K examples

For $i=1 \mbox{ to } 100$ (usually 50 - 1000) {
Randomly initialize K-means.
Run K-means. Get $c^{(1)},...c^{(m)},\mu_1,...,\mu_k$.
Compute cost function (distortion) $J(c^{(1)},...c^{(m)},\mu_1,...,\mu_k)$
}

Pick clustering that gave lowest cost $J(c^{(1)},...c^{(m)},\mu_1,...,\mu_k)$

## choose the number of clusters

Chosen by hand (human insight) based on the purpose.
Can use elbow method

# Motivation

## data compression

Reduce data from 2D to 1D
Reduce data from 3D t0 1D; project to a plane

# Principal Component Analysis

## problem formulation

For the problem of dimensionality reduction, the most popular method.

Try to find a surface which to project the data so as to minimize the projection error (orthogonal distance between the point and the surface).

Reduce from n-dimension to k-dimension: Find k vectors $u^{(1)},u^{(2)},...,u^{(k)}$ onto which to project the data, so as to minimize the projection error.

Different from linear regression

Before:
perform mean normalization and feature scaling to have zero mean and comparable ranges.

## algorithm

### data preprocessing

Training set: $x^{(1)},x^{(2)},...,x^{(m)}$
Preprocessing (feature scaling/mean normalization):
$\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}$
Replace each $x_j^{(i)}$ with $x_j^{(i)} - \mu_j$
Scale features to have comparable range of values

Compute “covariance matrix”:
$\Sigma = \frac{1}{m} \sum_{i=1}^{n}(x^{(i)})(x^{(i)})^T$
This matrix has a property called symmetric positive definite
Compute “eigenvectors” of matrix $\Sigma$:

%X is m by n matrix
Sigma = 1/m*X'*X;
[U,S,V] = svd(Sigma);
%or, results are the same due to symmetric positive definite
eig(Sigma)
%
Ureduce = U(:,1:k);
z = Ureduce'*x;


Take first K columns of U matrix to form $U_{reduced}$
$z=U_{reduced}^T x$
z will be a $\mathbb{R}^K$

# applying PCA

## reconstruction from compressed representation

The approximate value on the projection surface:
$x_{approx}^{(i)} = U_{reduced} z^{(i)}$

## choosing the number of principal components

Choosing k

Average squared projection error: $\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} - x_{approx}^{(i)} \vert \vert^2$
Total variation in the data: $\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} \vert \vert^2$

Choosing k to be the smallest value so that
$\frac{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} - x_{approx}^{(i)} \vert \vert^2}{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} \vert \vert^2} \le 0.01$
99% of variance is retained.

Algorithm:
Try PCA with $k = 1,2,...$
Compute $U_{reduced}, z^{(1)}, z^{(2)},...,z^{(m)}, x_{approx}^{(1)},...,x_{approx}^{(m)}$
Check if the ratio less than or equal to 0.01

[U,S,V] = svd(Sigma)


The ratio can be computed as, for given k:
$1-\frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^{n} S_{ii}} \le 0.01$

### supervised learning speedup

Extract inputs:

Unlabeled data set $x^{(1)},x^{(2)},...,x^{(m)} \in \mathbb{R}^{10000}$ $\rightarrow$
$z^{(1)},z^{(2)},...,z^{(m)} \in \mathbb{R}^{1000}$
New training set: $(z^{(1)},y^{(1)}),(z^{(2)},y^{(2)}),...,(z^{(m)},y^{(m)})$

Mapping $x^{(i)}\rightarrow z^{(i)}$ should be defined by running PCA only on the training set. This mapping can be applied as well to the cross validation and test sets.

Not a good idea to reduce no. features to prevent overfitting

Try to do without PCA first.