introduction
supervised learning
having the idea that there is a relationship between input and output
regression
continuous value
classification
discreted valued output (0 or 1,2,3,…)
unsupervised learning
Derive structure from data where the effect of the vars are not necessarily known  
Derive this structure by clustering the data based on the relationship among vars  
No feedback based on the prediction results
- cocktail party problem
[w,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
model and cost function
model representation
m = number of training examples  
x = input variable/ features  
y = output variable/”target” variable  
 = one training examples  
 = the ith training example
procedure
training set  learning algorithm  h(hypothesis)  
x  h  y  
h maps from x’s to y’s
hypothesis
linear regression with one variable.  
univariate(one variable) linear regression  
  
 can be simplified as   
: parameters
cost function
goal: minimize  by adjusting  and   
hypothesis:   
cost function(squared error function):   
parameters:   
  
 is as a convenience for the computation of the gradient descent
parameter learning
gradient descent
have some function   
want minimize this function  by adjusting 
outline
start with some   
keep changing  to reduce  until ending up at a minimum
algorithm
repeat until convergence {  
\[\theta_{j}:=\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1}) \text{ for } j=0 \text{ and } j=1\]  
}  
correct: simultaneous update  
  
  
  
gradient descent intuition
  
 is the learning rate or the coefficient of length of a step  
 is the derivative or gradient or slope with respect to 
as we approach a local minimum, gradient descent will automatically take smaller steps (the gradient becomes smaller), so no need to decrease over time.
gradient descent for linear regression
apply gradient descent to minimize cost function   
gradient descent  
repeat until convergence {  
  
}  
linear regression  
  
  
convex function means a bowl-shaped function, having only one local minimum
“Batch” gradient descent
each step of gradient descent uses all m the training examples

