Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Stanford ML 3 Classification Problems

2016-12-27

classification and representation

classification (binary)

\(y\in \{0,1\}\); “negative class”: 0; “positive class”: 1.
logistic regression: \(0 \le h_{\theta}(x) \le 1\)

hypothesis representation

logistic regression model

Want \(0 \le h_{\theta}(x) \le 1\)

  • \(h_{\theta}(x)=g(\theta^Tx)\)
  • sigmoid(logistic) function: \(g(z)=\frac{1}{1+e^{-z}}\)
  • \(g’(z) = g(z)(1-g(z))\)
  • \(h_{\theta}(x)\) = estimated probability that \(y=1\) on input x
  • \(h_{\theta}(x)=P(y=1 \vert x;\theta)\): probability that \(y=1\), given x, parameterized by \(\theta\)

decision boundary

Suppose predict “\(y=1\)” if \(h_{\theta}(x) \ge 0.5\); “\(y=0\)” if \(h_{\theta}(x) \lt 0.5\)
For sigmoid function, \(g(z) \ge 0.5\) when \(z \ge 0\)
So “\(y=1\)” will be predicted if \(h_{\theta}(x)=g(\theta^Tx) \ge 0.5\) whenever \(\theta^Tx \ge 0\)

definition

The decision boundary is \(\theta^Tx = 0\)
It’s a property of the problem and parameters, not of the training set

non-linear decision boundaries

e.g.: \(h_{\theta}(x)=g(-1+x_1^2+x_2^2)\)

logistic regression model

cost function

Linear regression: \(J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})\)
Cost function: \(Cost(h_{\theta}(x),y)=\frac{1}{2}(h_{\theta}(x)-y)^2\)
For logistic problem, the \(J({\theta})\) will be a non-convex function with multiple local minimum

logistic regression cost function

\(y \in \{0,1\}\)

properties (for \(y=1\)):

  • convex
  • Cost = 0 if \(y=1\), \(h_{\theta}(x)=1\)
  • As \(h_{\theta}(x) \rightarrow 0\), Cost \(\rightarrow \infty\)

simplified cost function and gradient descent

Derived using maximum likelihood method
\(Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))\)

gradient descent

\(J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})\)
\(\frac{\partial J(\theta)}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\)
Want minimize \(J(\theta)\):
Repeat {
\[\theta_j:=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta)\]
} (simultaneously update all \(\theta_j\))
Vectorized version: \(\theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y})\)

feature scaling

also needed

advanced optimization

optimization algorithm

  • Gradient descent
  • Conjugate descent
  • BFGS
  • L-BFGS

Comments of the last 3 algorithms

have an inner method to pick different and good enough learning rate \(\alpha\) for different loops

advantages:

  • no need to manually pick \(\alpha\)
  • often faster than gradient descent

disadvantage:

  • more complex

code implement

function [jval,gradient]=costfunc(theta)
jval = [code];
gradient(1) = [code];
%...
gradient(n) = [code];
end

options = optimset('GradObj','on','MaxIter','100');
inittheta = zeros(2,1);
[finaltheta,funcvalue,exitflag]=fminunc(@costfunc,initialtheta,options);

multiclass classification

e.g. weather: sunny, cloudy, rain, snow(1,2,3,4)

one-vs-all(one-vs-rest)

For every class, train a logistic regression classifier \(h_\theta^{(i)}(x)=P(y=i;x,\theta)\) with the training set to predict the probability that \(y=i\)

solving the problem of overfitting

the problem of overfitting

Problem:

  • underfit: too few parameters, high bias
  • good fit
  • overfit: too much parameters, high variance

If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.
Regularization will reduce the overfitting problem.

Options:

  • reduce number of features
    • manually select which features to keep
    • model selection algorithm
  • regularization
    • keep all features, but reduce magnitude/values of parameter \(\theta_j\)
    • regularization works well when we have a lot of slightly useful features

cost function

Intuition: Suppose we penalize and make \(\theta_3\) and \(\theta_4\) really small.

Regularization.
Small values for parameters \(\theta_1\),\(\theta_2\)…,\(\theta_n\); but not \(\theta_0\)

  • “simpler” hypothesis
  • less prone to overfitting

Modify cost function:
\(\lambda\) is the regularization parameter, control the fitting goal and avoid overfitting; too large \(\lambda\) will cause underfitting with a horizontal straight line
\[J(\theta)=\frac{1}{2m}\left[ \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum_{j=1}^{n}\theta_j^2 \right]\]

regularized linear regression

Gradient descent

Repeat {
\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]
\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\]
}

The parameter \((1-\alpha\frac{\lambda}{m}) \lt 1\) (a little bit)

normal equation

\[\theta=\left( X^TX+\lambda \centerdot L \right)^{-1}X^Ty\]

\(L\) is a \((n+1)\times(n+1)\) identity matrix but \(L(1,1)\) equals to 0

non-invertibility

Suppose \(m \le n\)
\(X^TX\) will be non-invertible/singular
Use pinv() in Octave to solve it. However, if \(\lambda \gt 0\), \((X^TX+\lambda \centerdot L)\) is invertible.

regularized logistic regression

\(Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))\)

gradient descent

\[J({\theta})=-\left[ \frac{1}{m}\sum_{i=1}^m \left( y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right)\right]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2\]

Repeat {
\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]
\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha \left[ \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} \right]\]
\(j=1,2,3,…,n\)
}


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.