2016-12-23

# introduction

## supervised learning

having the idea that there is a relationship between input and output

continuous value

### classification

discreted valued output (0 or 1,2,3,…)

## unsupervised learning

Derive structure from data where the effect of the vars are not necessarily known
Derive this structure by clustering the data based on the relationship among vars
No feedback based on the prediction results

• cocktail party problem
[w,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');


# model and cost function

## model representation

m = number of training examples
x = input variable/ features
y = output variable/”target” variable
$(x,y)$ = one training examples
$(x^{(i)},y^{(i)})$ = the ith training example

### procedure

training set $\rightarrow$ learning algorithm $\rightarrow$ h(hypothesis)
x $\rightarrow$ h $\rightarrow$ y
h maps from x’s to y’s

### hypothesis

linear regression with one variable.
univariate(one variable) linear regression
$h_{\theta}(x)={\theta}_0 + {\theta}_1x$
$h_{\theta}(x)$ can be simplified as $h(x)$
${\theta}_{i}'s$: parameters

## cost function

goal: minimize $\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$ by adjusting ${\theta}_0$ and ${\theta}_1$
hypothesis: $h_{\theta}(x)={\theta}_0 + {\theta}_1x$
cost function(squared error function): $J({\theta}_{0},{\theta}_{1})$
parameters: ${\theta}_{i}'s$
$J({\theta}_{0},{\theta}_{1})=\frac{1}{2m}\sum_{i=1}^m(\hat{y}_{i}-y^{(i)})^2=\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$
$\frac{1}{2}$ is as a convenience for the computation of the gradient descent

# parameter learning

have some function $J(\theta_{0},\theta_{1})$
want minimize this function $J$ by adjusting $\theta_{0},\theta_{1}$

### outline

start with some $\theta_{0},\theta_{1}$
keep changing $\theta_{0},\theta_{1}$ to reduce $J(\theta_{0},\theta_{1})$ until ending up at a minimum

### algorithm

repeat until convergence {
$\theta_{j}:=\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1}) \text{ for } j=0 \text{ and } j=1$
}
correct: simultaneous update
$temp0:=\theta_{0}-\alpha\frac{\partial}{\partial\theta_{0}}J(\theta_{0},\theta_{1})$
$temp1:=\theta_{1}-\alpha\frac{\partial}{\partial\theta_{1}}J(\theta_{1},\theta_{1})$
$\theta_{0} := temp0$
$\theta_{1} := temp1$

$\theta_{j}:=\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1}) \text{ simultaneously update } j=0 \text{ and } j=1$
$\alpha$ is the learning rate or the coefficient of length of a step
$\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1})$ is the derivative or gradient or slope with respect to $\theta_{j}$

as we approach a local minimum, gradient descent will automatically take smaller steps (the gradient becomes smaller), so no need to decrease $\alpha$ over time.

## gradient descent for linear regression

apply gradient descent to minimize cost function $J({\theta}_{0},{\theta}_{1})$
repeat until convergence {
$\theta_{j}:=\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1}) \text{ for } j=0 \text{ and } j=1$
}
linear regression
$h_{\theta}(x)={\theta}_0 + {\theta}_1x$
$J({\theta}_{0},{\theta}_{1})=\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$
convex function means a bowl-shaped function, having only one local minimum