Clustering
Unsupervised Learning Introduction
K-Means Algorithm
Input:
- K (number of clusters)
- Training set
(drop convention)
Algorithm
Randomly initialize K cluster centroids
Repeat {
for
index (from 1 to K) of cluster centroid closest to ; minimize
for
average (mean) of points assigned to cluster k
}
optimization objective
= index of cluster (1,2,…,K) to which example is currently assigned
= cluster centroid k ()
= cluster centroid of cluster to which example has been assigned
Optimization objective:
Minimize:
\[J(c^{(1)},…c^{(m)},\mu_1,…,\mu_k)=\frac{1}{m} \sum_{i=1}^{m} \vert \vert x^{(i)}-\mu_{c^{(i)}} \vert \vert^2\]
random initialization
Should have
Randomly pick K training examples
Set equal to these K examples
For (usually 50 - 1000) {
Randomly initialize K-means.
Run K-means. Get .
Compute cost function (distortion)
}
Pick clustering that gave lowest cost
choose the number of clusters
Chosen by hand (human insight) based on the purpose.
Can use elbow method
Motivation
data compression
Reduce data from 2D to 1D
Reduce data from 3D t0 1D; project to a plane
visualization
Principal Component Analysis
problem formulation
For the problem of dimensionality reduction, the most popular method.
Try to find a surface which to project the data so as to minimize the projection error (orthogonal distance between the point and the surface).
Reduce from n-dimension to k-dimension: Find k vectors onto which to project the data, so as to minimize the projection error.
Different from linear regression
Before:
perform mean normalization and feature scaling to have zero mean and comparable ranges.
algorithm
data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):
Replace each with
Scale features to have comparable range of values
Compute “covariance matrix”:
\[\Sigma = \frac{1}{m} \sum_{i=1}^{n}(x^{(i)})(x^{(i)})^T \]
This matrix has a property called symmetric positive definite
Compute “eigenvectors” of matrix :
%X is m by n matrix
Sigma = 1/m*X'*X;
[U,S,V] = svd(Sigma);
%or, results are the same due to symmetric positive definite
eig(Sigma)
%
Ureduce = U(:,1:k);
z = Ureduce'*x;
Take first K columns of U matrix to form
\[z=U_{reduced}^T x\]
z will be a
applying PCA
reconstruction from compressed representation
The approximate value on the projection surface:
\[x_{approx}^{(i)} = U_{reduced} z^{(i)}\]
choosing the number of principal components
Choosing k
Average squared projection error:
Total variation in the data:
Choosing k to be the smallest value so that
\[\frac{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} - x_{approx}^{(i)} \vert \vert^2}{\frac{1}{m}\sum_{i=1}^{m} \vert \vert x^{(i)} \vert \vert^2} \le 0.01\]
99% of variance is retained.
Algorithm:
Try PCA with
Compute
Check if the ratio less than or equal to 0.01
[U,S,V] = svd(Sigma)
The ratio can be computed as, for given k:
\[1-\frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^{n} S_{ii}} \le 0.01\]
advice for applying PCA
supervised learning speedup
Extract inputs:
Unlabeled data set
New training set:
Mapping should be defined by running PCA only on the training set. This mapping can be applied as well to the cross validation and test sets.
Not a good idea to reduce no. features to prevent overfitting
Try to do without PCA first.