# classification and representation

## classification (binary)

\(y\in \{0,1\}\); “negative class”: 0; “positive class”: 1.

logistic regression: \(0 \le h_{\theta}(x) \le 1\)

## hypothesis representation

### logistic regression model

Want \(0 \le h_{\theta}(x) \le 1\)

- \(h_{\theta}(x)=g(\theta^Tx)\)
- sigmoid(logistic) function: \(g(z)=\frac{1}{1+e^{-z}}\)
- \(g’(z) = g(z)(1-g(z))\)
- \(h_{\theta}(x)\) = estimated probability that \(y=1\) on input x
- \(h_{\theta}(x)=P(y=1 \vert x;\theta)\): probability that \(y=1\), given x, parameterized by \(\theta\)

## decision boundary

Suppose predict “\(y=1\)” if \(h_{\theta}(x) \ge 0.5\); “\(y=0\)” if \(h_{\theta}(x) \lt 0.5\)

For sigmoid function, \(g(z) \ge 0.5\) when \(z \ge 0\)

So “\(y=1\)” will be predicted if \(h_{\theta}(x)=g(\theta^Tx) \ge 0.5\) whenever \(\theta^Tx \ge 0\)

### definition

The decision boundary is \(\theta^Tx = 0\)

It’s a property of the problem and parameters, not of the training set

### non-linear decision boundaries

e.g.: \(h_{\theta}(x)=g(-1+x_1^2+x_2^2)\)

# logistic regression model

## cost function

Linear regression: \(J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})\)

Cost function: \(Cost(h_{\theta}(x),y)=\frac{1}{2}(h_{\theta}(x)-y)^2\)

For logistic problem, the \(J({\theta})\) will be a non-convex function with multiple local minimum

### logistic regression cost function

\(y \in \{0,1\}\)

properties (for \(y=1\)):

- convex
- Cost = 0 if \(y=1\), \(h_{\theta}(x)=1\)
- As \(h_{\theta}(x) \rightarrow 0\), Cost \(\rightarrow \infty\)

## simplified cost function and gradient descent

Derived using maximum likelihood method

\(Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))\)

### gradient descent

\(J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})\)

\(\frac{\partial J(\theta)}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\)

Want minimize \(J(\theta)\):

Repeat {

\[\theta_j:=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta)\]

} (simultaneously update all \(\theta_j\))

Vectorized version: \(\theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y})\)

### feature scaling

also needed

## advanced optimization

optimization algorithm

- Gradient descent
- Conjugate descent
- BFGS
- L-BFGS

### Comments of the last 3 algorithms

have an inner method to pick different and good enough learning rate \(\alpha\) for different loops

advantages:

- no need to manually pick \(\alpha\)
- often faster than gradient descent

disadvantage:

- more complex

## code implement

```
function [jval,gradient]=costfunc(theta)
jval = [code];
gradient(1) = [code];
%...
gradient(n) = [code];
end
options = optimset('GradObj','on','MaxIter','100');
inittheta = zeros(2,1);
[finaltheta,funcvalue,exitflag]=fminunc(@costfunc,initialtheta,options);
```

# multiclass classification

e.g. weather: sunny, cloudy, rain, snow(1,2,3,4)

## one-vs-all(one-vs-rest)

For every class, train a logistic regression classifier \(h_\theta^{(i)}(x)=P(y=i;x,\theta)\) with the training set to predict the probability that \(y=i\)

# solving the problem of overfitting

## the problem of overfitting

Problem:

- underfit: too few parameters, high bias
- good fit
- overfit: too much parameters, high variance

If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.

Regularization will reduce the overfitting problem.

Options:

- reduce number of features
- manually select which features to keep
- model selection algorithm

- regularization
- keep all features, but reduce magnitude/values of parameter \(\theta_j\)
- regularization works well when we have a lot of slightly useful features

## cost function

Intuition: Suppose we penalize and make \(\theta_3\) and \(\theta_4\) really small.

Regularization.

Small values for parameters \(\theta_1\),\(\theta_2\)…,\(\theta_n\); but not \(\theta_0\)

- “simpler” hypothesis
- less prone to overfitting

Modify cost function:

\(\lambda\) is the regularization parameter, control the fitting goal and avoid overfitting; too large \(\lambda\) will cause underfitting with a horizontal straight line

\[J(\theta)=\frac{1}{2m}\left[ \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum_{j=1}^{n}\theta_j^2 \right]\]

## regularized linear regression

### Gradient descent

Repeat {

\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]

\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\]

}

The parameter \((1-\alpha\frac{\lambda}{m}) \lt 1\) (a little bit)

### normal equation

\[\theta=\left( X^TX+\lambda \centerdot L \right)^{-1}X^Ty\]

\(L\) is a \((n+1)\times(n+1)\) identity matrix but \(L(1,1)\) equals to 0

### non-invertibility

Suppose \(m \le n\)

\(X^TX\) will be non-invertible/singular

Use `pinv()`

in Octave to solve it. However, if \(\lambda \gt 0\), \((X^TX+\lambda \centerdot L)\) is invertible.

## regularized logistic regression

\(Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))\)

### gradient descent

\[J({\theta})=-\left[ \frac{1}{m}\sum_{i=1}^m \left( y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right)\right]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2\]

Repeat {

\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]

\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha \left[ \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} \right]\]

\(j=1,2,3,…,n\)

}