classification and representation

classification (binary)

$y\in \{0,1\}$; “negative class”: 0; “positive class”: 1.
logistic regression: $0 \le h_{\theta}(x) \le 1$

hypothesis representation

logistic regression model

Want $0 \le h_{\theta}(x) \le 1$

$h_{\theta}(x)=g(\theta^Tx)$
sigmoid(logistic) function: $g(z)=\frac{1}{1+e^{-z}}$
$g’(z) = g(z)(1-g(z))$
$h_{\theta}(x)$ = estimated probability that $y=1$ on input x
$h_{\theta}(x)=P(y=1 \vert x;\theta)$: probability that $y=1$, given x, parameterized by $\theta$

decision boundary

Suppose predict “$y=1$” if $h_{\theta}(x) \ge 0.5$; “$y=0$” if $h_{\theta}(x) \lt 0.5$
For sigmoid function, $g(z) \ge 0.5$ when $z \ge 0$
So “$y=1$” will be predicted if $h_{\theta}(x)=g(\theta^Tx) \ge 0.5$ whenever $\theta^Tx \ge 0$

definition

The decision boundary is $\theta^Tx = 0$
It’s a property of the problem and parameters, not of the training set

non-linear decision boundaries

e.g.: $h_{\theta}(x)=g(-1+x_1^2+x_2^2)$

logistic regression model

cost function

Linear regression: $J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})$
Cost function: $Cost(h_{\theta}(x),y)=\frac{1}{2}(h_{\theta}(x)-y)^2$
For logistic problem, the $J({\theta})$ will be a non-convex function with multiple local minimum

logistic regression cost function

$y \in \{0,1\}$

$Cost(h_{\theta}(x),y) = \left\{ \begin{array}{ccc} -\log (h_{\theta}(x)) & \mbox{if} & y = 1; \\ -\log (1-h_{\theta}(x)) & \mbox{if} & y = 0 \end{array} \right.$

properties (for $y=1$):

convex
Cost = 0 if $y=1$, $h_{\theta}(x)=1$
As $h_{\theta}(x) \rightarrow 0$, Cost $\rightarrow \infty$

simplified cost function and gradient descent

Derived using maximum likelihood method
$Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))$

gradient descent

$J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})$
$\frac{\partial J(\theta)}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$
Want minimize $J(\theta)$:
Repeat {
\[\theta_j:=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta)\]
} (simultaneously update all $\theta_j$)
Vectorized version: $\theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y})$

feature scaling

also needed

advanced optimization

optimization algorithm

Gradient descent
Conjugate descent
BFGS
L-BFGS

Comments of the last 3 algorithms

have an inner method to pick different and good enough learning rate $\alpha$ for different loops

advantages:

no need to manually pick $\alpha$
often faster than gradient descent

disadvantage:

more complex

code implement

function [jval,gradient]=costfunc(theta)
jval = [code];
gradient(1) = [code];
%...
gradient(n) = [code];
end

options = optimset('GradObj','on','MaxIter','100');
inittheta = zeros(2,1);
[finaltheta,funcvalue,exitflag]=fminunc(@costfunc,initialtheta,options);

multiclass classification

e.g. weather: sunny, cloudy, rain, snow(1,2,3,4)

one-vs-all(one-vs-rest)

For every class, train a logistic regression classifier $h_\theta^{(i)}(x)=P(y=i;x,\theta)$ with the training set to predict the probability that $y=i$

solving the problem of overfitting

the problem of overfitting

Problem:

underfit: too few parameters, high bias
good fit
overfit: too much parameters, high variance

If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.
Regularization will reduce the overfitting problem.

Options:

reduce number of features
- manually select which features to keep
- model selection algorithm
regularization
- keep all features, but reduce magnitude/values of parameter $\theta_j$
- regularization works well when we have a lot of slightly useful features

cost function

Intuition: Suppose we penalize and make $\theta_3$ and $\theta_4$ really small.

Regularization.
Small values for parameters $\theta_1$,$\theta_2$…,$\theta_n$; but not $\theta_0$

“simpler” hypothesis
less prone to overfitting

Modify cost function:
$\lambda$ is the regularization parameter, control the fitting goal and avoid overfitting; too large $\lambda$ will cause underfitting with a horizontal straight line
\[J(\theta)=\frac{1}{2m}\left[ \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum_{j=1}^{n}\theta_j^2 \right]\]

regularized linear regression

Gradient descent

Repeat {
\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]
\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\]
}

The parameter $(1-\alpha\frac{\lambda}{m}) \lt 1$ (a little bit)

normal equation

\[\theta=\left( X^TX+\lambda \centerdot L \right)^{-1}X^Ty\]

$L$ is a $(n+1)\times(n+1)$ identity matrix but $L(1,1)$ equals to 0

non-invertibility

Suppose $m \le n$
$X^TX$ will be non-invertible/singular
Use pinv() in Octave to solve it. However, if $\lambda \gt 0$, $(X^TX+\lambda \centerdot L)$ is invertible.

regularized logistic regression

$Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))$

gradient descent

\[J({\theta})=-\left[ \frac{1}{m}\sum_{i=1}^m \left( y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right)\right]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2\]

Repeat {
\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]
\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha \left[ \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} \right]\]
$j=1,2,3,…,n$
}

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1

Stanford ML 3 Classification Problems

classification and representation

classification (binary)

hypothesis representation

logistic regression model

decision boundary

definition

non-linear decision boundaries

logistic regression model

cost function

logistic regression cost function

simplified cost function and gradient descent

gradient descent

feature scaling

advanced optimization

Comments of the last 3 algorithms

code implement

multiclass classification

one-vs-all(one-vs-rest)

solving the problem of overfitting