Large Margin Classification
optimization objective
modify the cost function curve for logistic regression into two straight line segments with one horizontal 0
Support vector machine:
cost function:
\[min_{\theta}: C \left[ \sum_{i=1}^{m} y^{(i)}cost_1\left(\theta^T x^{(i)} \right) + \left(1-y^{(i)} \right)cost_0 \left(\theta^T x^{(i)} \right)\right] + \frac{1}{2}\sum_{j=1}^n \theta_j^2 \]
large margin intuition
If , we want (not just 0). So if , the function will return 0.
If , we want (not just 0)
margin of the support vector machine or large margin classifier
The algorithm will give the largest margin to separate the positive and negative examples while setting C very large.
Using not too large C can ignore some outliner.
mathematics behind large margin classification
norm:
SVM decision boundary
vector is perpendicular to the decision boundary
Kernels
Given , compute new features depending on proximity to landmarks . Landmarks are values of .
Hypothesis: predict “1” when
feature :
\[f_1 =similarity(x,l^{(1)}) = exp\left( - \frac{||x-l^{(1)}||^2}{2\sigma^2}\right)\]
Similarity function is a kernel (the formula above is Gaussian kernel)
If ,
If is far from ,
As increases, the similarity function tends to become sharper.
choosing the landmarks
Where to get landmarks ?
Use training examples for landmarks. So we will have m landmarks.
For training example , we have m formula as:
\[f_j^{(i)}=sim(x^{(i)},l^{(j)})\]
\[x^{(i)} \in \mathbb{R}^{n+1}\]
\[f^{(i)}=[f_0^{(i)},f_1^{(i)},f_2^{(i)},…,f_m^{(i)}]^T\]
where
SVM with kernels
Hypothesis: given , compute features
Predict “” if where
Training:
\[min_{\theta}: C\sum_{i=1}^{m} \left[ y^{(i)} cost_1(\theta^T f^{(i)})+(1-y^{(i)}) cost_0(\theta^T f^{(i)}) \right]+ \frac{1}{2}\sum_{j=1}^{m} \theta_j^2\]
SVM parameters
:
Large : Lower bias, high variance
Small : higher bias, low variance
:
Large : Features vary more smoothly. higher bias, lower variance
Small : Features vary less smoothly. lower bias, higher variance
SVMs in practice
Use SVM software package (e.g. liblinear, libsvm, …) to solve for parameters .
Need to specify:
- choice of parameter
- choice of kernel (similarity function):
- e.g. no kernel (“linear kernel”): predict “y=1” if ( for large n, small m)
- Gaussian kernel (for small n, large m), need to choose ; need to perform feature scaling before.
Not all similarity functions make valid kernels. Need to satisfy “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly and do not diverge.
Many off-the-shelf kernels available:
- polynomial kernel: e.g.
- More esoteric: String kernel, chi-square kernel, histogram intersection kernel,…
multi-class classification
Many SVM packages have built-in multi-class classification
Otherwise, use one vs. all method (train K SVMs, one to distinguish from the rest, for ), get . Pick class i with largest .
Logistic regression vs. SVMs
- If n is large relative to m (n=10000, m =10-1000), use logistic regression or SVM without a kernel
- If n is small, m is intermediate (n=1-1000, m=10-10000), use SVM with Gaussian kernel
- If n is small, m is large (n=1-1000, m >50000), create/add more features, then use logistic regression or SVM without a kernel
Neural network likely to work well for most of these settings, but may be slower to train.