Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Stanford ML 7 Support Vector Machine

2017-01-20

Large Margin Classification

optimization objective

modify the cost function curve for logistic regression into two straight line segments with one horizontal 0

Support vector machine:

cost function:
\[min_{\theta}: C \left[ \sum_{i=1}^{m} y^{(i)}cost_1\left(\theta^T x^{(i)} \right) + \left(1-y^{(i)} \right)cost_0 \left(\theta^T x^{(i)} \right)\right] + \frac{1}{2}\sum_{j=1}^n \theta_j^2 \]

large margin intuition

If , we want (not just 0). So if , the function will return 0.
If , we want (not just 0)

margin of the support vector machine or large margin classifier

The algorithm will give the largest margin to separate the positive and negative examples while setting C very large.
Using not too large C can ignore some outliner.

mathematics behind large margin classification

norm:

SVM decision boundary

vector is perpendicular to the decision boundary

Kernels

Given , compute new features depending on proximity to landmarks . Landmarks are values of .

Hypothesis: predict “1” when

feature :
\[f_1 =similarity(x,l^{(1)}) = exp\left( - \frac{||x-l^{(1)}||^2}{2\sigma^2}\right)\]

Similarity function is a kernel (the formula above is Gaussian kernel)

If ,
If is far from ,

As increases, the similarity function tends to become sharper.

choosing the landmarks

Where to get landmarks ?

Use training examples for landmarks. So we will have m landmarks.
For training example , we have m formula as:
\[f_j^{(i)}=sim(x^{(i)},l^{(j)})\]

\[x^{(i)} \in \mathbb{R}^{n+1}\]
\[f^{(i)}=[f_0^{(i)},f_1^{(i)},f_2^{(i)},…,f_m^{(i)}]^T\]
where

SVM with kernels

Hypothesis: given , compute features
Predict “” if where
Training:
\[min_{\theta}: C\sum_{i=1}^{m} \left[ y^{(i)} cost_1(\theta^T f^{(i)})+(1-y^{(i)}) cost_0(\theta^T f^{(i)}) \right]+ \frac{1}{2}\sum_{j=1}^{m} \theta_j^2\]

SVM parameters

:

Large : Lower bias, high variance
Small : higher bias, low variance

:

Large : Features vary more smoothly. higher bias, lower variance
Small : Features vary less smoothly. lower bias, higher variance

SVMs in practice

Use SVM software package (e.g. liblinear, libsvm, …) to solve for parameters .

Need to specify:

  • choice of parameter
  • choice of kernel (similarity function):
    • e.g. no kernel (“linear kernel”): predict “y=1” if ( for large n, small m)
    • Gaussian kernel (for small n, large m), need to choose ; need to perform feature scaling before.

Not all similarity functions make valid kernels. Need to satisfy “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly and do not diverge.

Many off-the-shelf kernels available:

  • polynomial kernel: e.g.
  • More esoteric: String kernel, chi-square kernel, histogram intersection kernel,…

multi-class classification

Many SVM packages have built-in multi-class classification
Otherwise, use one vs. all method (train K SVMs, one to distinguish from the rest, for ), get . Pick class i with largest .

Logistic regression vs. SVMs

  1. If n is large relative to m (n=10000, m =10-1000), use logistic regression or SVM without a kernel
  2. If n is small, m is intermediate (n=1-1000, m=10-10000), use SVM with Gaussian kernel
  3. If n is small, m is large (n=1-1000, m >50000), create/add more features, then use logistic regression or SVM without a kernel

Neural network likely to work well for most of these settings, but may be slower to train.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.