2017-01-20

# Large Margin Classification

## optimization objective

modify the cost function curve for logistic regression into two straight line segments with one horizontal 0

Support vector machine:

cost function:
$min_{\theta}: C \left[ \sum_{i=1}^{m} y^{(i)}cost_1\left(\theta^T x^{(i)} \right) + \left(1-y^{(i)} \right)cost_0 \left(\theta^T x^{(i)} \right)\right] + \frac{1}{2}\sum_{j=1}^n \theta_j^2$

## large margin intuition

If $y=1$, we want $\theta^Tx \ge 1$ (not just 0). So if $\theta^Tx \ge 1$, the $cost_1$ function will return 0.
If $y=0$, we want $\theta^Tx \le -1$ (not just 0)

margin of the support vector machine or large margin classifier

The algorithm will give the largest margin to separate the positive and negative examples while setting C very large.
Using not too large C can ignore some outliner.

## mathematics behind large margin classification

norm: $\vert \vert u \vert \vert = \sqrt{u^Tu}$

### SVM decision boundary

vector $\theta$ is perpendicular to the decision boundary

# Kernels

Given $x$, compute new features depending on proximity to landmarks $l^{(1)},l^{(2)},l^{(3)}$. Landmarks are values of $x$.

Hypothesis: predict “1” when $\theta_0 +\theta_1 f_1 +...+\theta_n f_n \ge 0$

feature $f_1,f_2,f_3$:
$f_1 =similarity(x,l^{(1)}) = exp\left( - \frac{||x-l^{(1)}||^2}{2\sigma^2}\right)$

Similarity function is a kernel (the formula above is Gaussian kernel)

If $x \approx l^{(1)}$, $f_1 \approx 1$
If $x$ is far from $l^{(1)}$, $f_1 \approx 0$

As $\sigma^2$ increases, the similarity function tends to become sharper.

## choosing the landmarks

Where to get landmarks $l^{(1)},l^{(2)},l^{(3)},...$?

Use training examples for landmarks. So we will have m landmarks.
For training example $(x^{(i)},y^{(i)})$, we have m formula as:
$f_j^{(i)}=sim(x^{(i)},l^{(j)})$

$x^{(i)} \in \mathbb{R}^{n+1}$
$f^{(i)}=[f_0^{(i)},f_1^{(i)},f_2^{(i)},…,f_m^{(i)}]^T$
where $f_0^{(i)}=1$

## SVM with kernels

Hypothesis: given $x$, compute features $f \in \mathbb{R}^{m+1}$
Predict “$y=1$” if $\theta^T f \ge 0$ where $\theta \in \mathbb{R}^{m+1}$
Training:
$min_{\theta}: C\sum_{i=1}^{m} \left[ y^{(i)} cost_1(\theta^T f^{(i)})+(1-y^{(i)}) cost_0(\theta^T f^{(i)}) \right]+ \frac{1}{2}\sum_{j=1}^{m} \theta_j^2$

## SVM parameters

$C = 1/\lambda$:

Large $C$: Lower bias, high variance
Small $C$: higher bias, low variance

$\sigma^2$:

Large $\sigma^2$: Features $f_i$ vary more smoothly. higher bias, lower variance
Small $\sigma^2$: Features $f_i$ vary less smoothly. lower bias, higher variance

# SVMs in practice

Use SVM software package (e.g. liblinear, libsvm, …) to solve for parameters $\theta$.

Need to specify:

• choice of parameter $C$
• choice of kernel (similarity function):
• e.g. no kernel (“linear kernel”): predict “y=1” if $\theta^Tx \ge 0$ ( for large n, small m)
• Gaussian kernel (for small n, large m), need to choose $\sigma^2$; need to perform feature scaling before.

Not all similarity functions make valid kernels. Need to satisfy “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly and do not diverge.

Many off-the-shelf kernels available:

• polynomial kernel: e.g. $k(x,l)=(x^T l+a)^b$
• More esoteric: String kernel, chi-square kernel, histogram intersection kernel,…

## multi-class classification

Many SVM packages have built-in multi-class classification
Otherwise, use one vs. all method (train K SVMs, one to distinguish $y=i$ from the rest, for $i=1,2,...,K$), get $\theta^{(1)},\theta^{(2)},...,\theta^{(K)}$. Pick class i with largest $(\theta^{(i)})^Tx$.

## Logistic regression vs. SVMs

1. If n is large relative to m (n=10000, m =10-1000), use logistic regression or SVM without a kernel
2. If n is small, m is intermediate (n=1-1000, m=10-10000), use SVM with Gaussian kernel
3. If n is small, m is large (n=1-1000, m >50000), create/add more features, then use logistic regression or SVM without a kernel

Neural network likely to work well for most of these settings, but may be slower to train.