classification and representation
classification (binary)
\(y\in \{0,1\}\); “negative class”: 0; “positive class”: 1.
logistic regression: \(0 \le h_{\theta}(x) \le 1\)
hypothesis representation
logistic regression model
Want \(0 \le h_{\theta}(x) \le 1\)
- \(h_{\theta}(x)=g(\theta^Tx)\)
- sigmoid(logistic) function: \(g(z)=\frac{1}{1+e^{-z}}\)
- \(g’(z) = g(z)(1-g(z))\)
- \(h_{\theta}(x)\) = estimated probability that \(y=1\) on input x
- \(h_{\theta}(x)=P(y=1 \vert x;\theta)\): probability that \(y=1\), given x, parameterized by \(\theta\)
decision boundary
Suppose predict “\(y=1\)” if \(h_{\theta}(x) \ge 0.5\); “\(y=0\)” if \(h_{\theta}(x) \lt 0.5\)
For sigmoid function, \(g(z) \ge 0.5\) when \(z \ge 0\)
So “\(y=1\)” will be predicted if \(h_{\theta}(x)=g(\theta^Tx) \ge 0.5\) whenever \(\theta^Tx \ge 0\)
definition
The decision boundary is \(\theta^Tx = 0\)
It’s a property of the problem and parameters, not of the training set
non-linear decision boundaries
e.g.: \(h_{\theta}(x)=g(-1+x_1^2+x_2^2)\)
logistic regression model
cost function
Linear regression: \(J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})\)
Cost function: \(Cost(h_{\theta}(x),y)=\frac{1}{2}(h_{\theta}(x)-y)^2\)
For logistic problem, the \(J({\theta})\) will be a non-convex function with multiple local minimum
logistic regression cost function
\(y \in \{0,1\}\)
properties (for \(y=1\)):
- convex
- Cost = 0 if \(y=1\), \(h_{\theta}(x)=1\)
- As \(h_{\theta}(x) \rightarrow 0\), Cost \(\rightarrow \infty\)
simplified cost function and gradient descent
Derived using maximum likelihood method
\(Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))\)
gradient descent
\(J({\theta})=\frac{1}{m}\sum_{i=1}^mCost(h_{\theta}(x^{(i)}),y^{(i)})\)
\(\frac{\partial J(\theta)}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\)
Want minimize \(J(\theta)\):
Repeat {
\[\theta_j:=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta)\]
} (simultaneously update all \(\theta_j\))
Vectorized version: \(\theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y})\)
feature scaling
also needed
advanced optimization
optimization algorithm
- Gradient descent
- Conjugate descent
- BFGS
- L-BFGS
Comments of the last 3 algorithms
have an inner method to pick different and good enough learning rate \(\alpha\) for different loops
advantages:
- no need to manually pick \(\alpha\)
- often faster than gradient descent
disadvantage:
- more complex
code implement
function [jval,gradient]=costfunc(theta)
jval = [code];
gradient(1) = [code];
%...
gradient(n) = [code];
end
options = optimset('GradObj','on','MaxIter','100');
inittheta = zeros(2,1);
[finaltheta,funcvalue,exitflag]=fminunc(@costfunc,initialtheta,options);
multiclass classification
e.g. weather: sunny, cloudy, rain, snow(1,2,3,4)
one-vs-all(one-vs-rest)
For every class, train a logistic regression classifier \(h_\theta^{(i)}(x)=P(y=i;x,\theta)\) with the training set to predict the probability that \(y=i\)
solving the problem of overfitting
the problem of overfitting
Problem:
- underfit: too few parameters, high bias
- good fit
- overfit: too much parameters, high variance
If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.
Regularization will reduce the overfitting problem.
Options:
- reduce number of features
- manually select which features to keep
- model selection algorithm
- regularization
- keep all features, but reduce magnitude/values of parameter \(\theta_j\)
- regularization works well when we have a lot of slightly useful features
cost function
Intuition: Suppose we penalize and make \(\theta_3\) and \(\theta_4\) really small.
Regularization.
Small values for parameters \(\theta_1\),\(\theta_2\)…,\(\theta_n\); but not \(\theta_0\)
- “simpler” hypothesis
- less prone to overfitting
Modify cost function:
\(\lambda\) is the regularization parameter, control the fitting goal and avoid overfitting; too large \(\lambda\) will cause underfitting with a horizontal straight line
\[J(\theta)=\frac{1}{2m}\left[ \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum_{j=1}^{n}\theta_j^2 \right]\]
regularized linear regression
Gradient descent
Repeat {
\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]
\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}\]
}
The parameter \((1-\alpha\frac{\lambda}{m}) \lt 1\) (a little bit)
normal equation
\[\theta=\left( X^TX+\lambda \centerdot L \right)^{-1}X^Ty\]
\(L\) is a \((n+1)\times(n+1)\) identity matrix but \(L(1,1)\) equals to 0
non-invertibility
Suppose \(m \le n\)
\(X^TX\) will be non-invertible/singular
Use pinv()
in Octave to solve it. However, if \(\lambda \gt 0\), \((X^TX+\lambda \centerdot L)\) is invertible.
regularized logistic regression
\(Cost(h_{\theta}(x),y)=-y\log(h_{\theta}(x))-(1-y)\log(1-h_{\theta}(x))\)
gradient descent
\[J({\theta})=-\left[ \frac{1}{m}\sum_{i=1}^m \left( y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right)\right]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2\]
Repeat {
\[\theta_0 := \theta_0 -\alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{0}^{(i)}\]
\[\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha \left[ \frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} \right]\]
\(j=1,2,3,…,n\)
}