# Clustering

## Unsupervised Learning Introduction

## K-Means Algorithm

### Input:

- K (number of clusters)
- Training set

(drop convention)

### Algorithm

Randomly initialize K cluster centroids

Repeat {

for

index (from 1 to K) of cluster centroid closest to ; minimize

for

average (mean) of points assigned to cluster k

}

## optimization objective

= index of cluster (1,2,…,K) to which example is currently assigned

= cluster centroid k ()

= cluster centroid of cluster to which example has been assigned

Optimization objective:

Minimize:

\[J(c^{(1)},…c^{(m)},\mu_1,…,\mu_k)=\frac{1}{m} \sum_{i=1}^{m} \vert \vert x^{(i)}-\mu_{c^{(i)}} \vert \vert^2\]

## random initialization

Should have

Randomly pick K training examples

Set equal to these K examples

For (usually 50 - 1000) {

Randomly initialize K-means.

Run K-means. Get .

Compute cost function (distortion)

}

Pick clustering that gave lowest cost

## choose the number of clusters

Chosen by hand (human insight) based on the purpose.

Can use elbow method

# Motivation

## data compression

Reduce data from 2D to 1D

Reduce data from 3D t0 1D; project to a plane

## visualization

# Principal Component Analysis

## problem formulation

For the problem of dimensionality reduction, the most popular method.

Try to find a surface which to project the data so as to minimize the projection error (orthogonal distance between the point and the surface).

Reduce from n-dimension to k-dimension: Find k vectors onto which to project the data, so as to minimize the projection error.

Different from linear regression

Before:

perform mean normalization and feature scaling to have zero mean and comparable ranges.

## algorithm

### data preprocessing

Training set:

Preprocessing (feature scaling/mean normalization):

Replace each with

Scale features to have comparable range of values

Compute “covariance matrix”:

\[\Sigma = \frac{1}{m} \sum_{i=1}^{n}(x^{(i)})(x^{(i)})^T \]

This matrix has a property called symmetric positive definite

Compute “eigenvectors” of matrix :

```
%X is m by n matrix
Sigma = 1/m*X'*X;
[U,S,V] = svd(Sigma);
%or, results are the same due to symmetric positive definite
eig(Sigma)
%
Ureduce = U(:,1:k);
z = Ureduce'*x;
```

Take first K columns of U matrix to form

\[z=U_{reduced}^T x\]

z will be a

# applying PCA

## reconstruction from compressed representation

The approximate value on the projection surface:

\[x_{approx}^{(i)} = U_{reduced} z^{(i)}\]

## choosing the number of principal components

Choosing k

Average squared projection error:

Total variation in the data:

Choosing k to be the smallest value so that

\[\frac{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} - x_{approx}^{(i)} \vert \vert^2}{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} \vert \vert^2} \le 0.01\]

99% of variance is retained.

Algorithm:

Try PCA with

Compute

Check if the ratio less than or equal to 0.01

```
[U,S,V] = svd(Sigma)
```

The ratio can be computed as, for given k:

\[1-\frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^{n} S_{ii}} \le 0.01\]

## advice for applying PCA

### supervised learning speedup

Extract inputs:

Unlabeled data set

New training set:

Mapping should be defined by running PCA only on the training set. This mapping can be applied as well to the cross validation and test sets.

Not a good idea to reduce no. features to prevent overfitting

Try to do without PCA first.