2016-12-26

# Multivariate linear regression

## multiple features

### notation:

• n = number of features
• $$x^{(i)}$$ = input (features) of $i^{th}$ training example
• $x_{j}^{(i)}$ = value of feature j in $i^{th}$ training example

### Hypothesis

hypothesis: $h_{\theta}(X)=\theta^{T}X$
define:

• $x_{0}^{(i)}=1$;
• $\theta=[\theta_{0},\theta_{0},\theta_{1}...\theta_{n}]^{T}$;
• $X=[x_{0},x_{0},x_{1}...x_{n}]^{T}$; (the training examples are stored in X row-wise
• for a case with 3 examples and 2 features: $% $
• the hypothesis is given as $h_{\theta}(X)=X\theta$

## gradient descent for multiple variables

parameters: $\theta$, a n+1 dimensional vector
cost function: $J({\theta}_{0},{\theta}_{1})=\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$
repeat until convergence {
$\theta_{j}:=\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1}...\theta_{n})$
} (simutaneously update for every j = 0,…, n)

new algorithm:
repeat until convergence {
$\theta_{j}:=\theta_{j}-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$
} (simutaneously update $\theta_j$ for every j = 0,…, n)

## gradient descent in practice - feature scaling

### feature scaling

idea: make sure features are on a similar scale like normalize all features
get every feature into approximately a $-1 \le x_{i} \le 1$ range

### mean normalization

replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean (do not apply to $x_0=1$)
or use $x_i=\frac{x_i-\mu_i}{s_i}$ ( $s_i$ is the range or std)

## gradient descent in practice - learning rate

### “debugging”

how to make sure gradient descent is working correctly?
plot $J(\theta)$ vs no. of iterations
$J(\theta)$ should decrease after every iteration
declare convergence if $J(\theta)$ decreases by less than a small value like $10^{-3}$ in one iteration
if $J(\theta)$ is increasing, try to use smaller $\alpha$
if $\alpha$ is too small, convergence will be slow

### learning rate

how to choose learning rate $\alpha$
try …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1,…

## features and polynomial regression

if fit the cubic polynomial, create new features $x_j$:
$h_{\theta}(x)=\theta_{0}+\theta_{1}x_1+\theta_{2}x_2+\theta_{3}x_3 \\ \hspace{1.1cm} = \theta_{0}+\theta_{1}(size)+\theta_{2}(size^2)+\theta_{3}(size^3)$

the $x_j$ can be arbitrary function, like $\sqrt[]{size}$
do the mean normalization and scaling carefully

# computing parameters analytically

## normal equation

method to solve for $\theta$ analytically
$\theta \in \mathbb{R}^{n+1}$ $\frac{\partial}{\partial \theta_j}J(\theta)=...=0$
solve for $\theta_0, \theta_1,...,\theta_n$
feature scaling is not required when using normal equation
$\\$
$X$ is a m by n+1 design matrix containing all input vectors; $y$ is the m by 1 vector to predict
$\theta=(X^TX)^{-1}X^Ty$
$(X^TX)^{-1}X^T$ is called Pseudoinverse of $X$

### deduction

$J({\theta})=\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2 \\ \hspace{0.9 cm} = \frac{1}{2m}\sum_{i=1}^m((x^{(i)})^T\theta-y^{(i)})^2 \\ \hspace{0.9 cm} = \frac{1}{2m} \| X\theta-y \|^2 \\ \hspace{0.9 cm} = \frac{1}{2m} (X\theta-y)^T(X\theta-y) \\ \hspace{0.9 cm} = \frac{1}{2m}（X\theta)^T(X\theta)- （X\theta)^Ty - y^TX\theta +y^Ty \\ \hspace{0.9 cm} = \frac{1}{2m}（X\theta)^T(X\theta)- 2（X\theta)^Ty+y^Ty$
$\\$
$\frac{dJ(\theta)}{d\theta} = \frac{1}{m} (X^TX\theta-X^Ty) = 0$
$\rightarrow$ $\theta =(X^TX)^{-1}X^Ty$

#### derivative of matrix

$\frac{\partial \beta^TX}{\partial X} = \beta$
$\frac{\partial X^TX}{\partial X} = X$
$\frac{\partial X^TAX}{\partial X} = (A+A^T)X$

derivative:
matrix calculus wiki
trace: $tr(A)=\sum_{i=1}^nA_{ii}$ trace wiki

need to choose $\alpha$ no need to choose $\alpha$
need many iterations don’t need to iterate
works well even when n is large compute n*n matrix
N/A slow if n is very large
$O（kn^2)$ $O（n^3)$
n>10000 smaller n

## normal equation noninvertibility

what if $X^TX$ is non-invertible? (singular)
in Octave, two function do the inversion: inv() and pinv(); pinv() will solve the problem

reason of noninvertibility:

• redundant features(linearly dependent)
• like size in square meter and size in square feet
• too many features(e.g. $m \le n$)
• delete some features or use regularization