Multivariate linear regression

multiple features

notation:

n = number of features
$x^{(i)}$ = input (features) of $i^{th}$ training example
$x_{j}^{(i)}$ = value of feature j in $i^{th}$ training example

Hypothesis

hypothesis: $h_{\theta}(X)=\theta^{T}X$
define:

$x_{0}^{(i)}=1$ ;
$\theta=[\theta_{0},\theta_{0},\theta_{1}...\theta_{n}]^{T}$ ;
; (the training examples are stored in X row-wise
- for a case with 3 examples and 2 features: $\left[ \begin{array}{ccc} x_{0}^{(1)} & x_{1}^{(1)} \\ x_{0}^{(2)} & x_{1}^{(2)} \\ x_{0}^{(3)} & x_{1}^{(3)} \end{array} \right]$
- the hypothesis is given as $h_{\theta}(X)=X\theta$

gradient descent for multiple variables

parameters: $\theta$ , a n+1 dimensional vector
cost function: $J({\theta}_{0},{\theta}_{1})=\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$
gradient descent:
repeat until convergence {
$\theta_{j}:=\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1}...\theta_{n})$
} (simutaneously update for every j = 0,…, n)

new algorithm:
repeat until convergence {
$\theta_{j}:=\theta_{j}-\alpha\frac{1}{m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$
} (simutaneously update $\theta_j$ for every j = 0,…, n)

gradient descent in practice - feature scaling

make gradient descent work faster

feature scaling

idea: make sure features are on a similar scale like normalize all features
get every feature into approximately a $-1 \le x_{i} \le 1$ range

mean normalization

replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean (do not apply to $x_0=1$ )
or use $x_i=\frac{x_i-\mu_i}{s_i}$ ( $s_i$ is the range or std)

gradient descent in practice - learning rate

“debugging”

how to make sure gradient descent is working correctly?
plot $J(\theta)$ vs no. of iterations
$J(\theta)$ should decrease after every iteration
declare convergence if $J(\theta)$ decreases by less than a small value like $10^{-3}$ in one iteration
if $J(\theta)$ is increasing, try to use smaller $\alpha$
if $\alpha$ is too small, convergence will be slow

learning rate

how to choose learning rate $\alpha$
try …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1,…

features and polynomial regression

if fit the cubic polynomial, create new features $x_j$ :
$h_{\theta}(x)=\theta_{0}+\theta_{1}x_1+\theta_{2}x_2+\theta_{3}x_3 \\ \hspace{1.1cm} = \theta_{0}+\theta_{1}(size)+\theta_{2}(size^2)+\theta_{3}(size^3)$

the $x_j$ can be arbitrary function, like $\sqrt[]{size}$
do the mean normalization and scaling carefully

computing parameters analytically

normal equation

method to solve for $\theta$ analytically
$\theta \in \mathbb{R}^{n+1}$ $\frac{\partial}{\partial \theta_j}J(\theta)=...=0$
solve for $\theta_0, \theta_1,...,\theta_n$
feature scaling is not required when using normal equation
$\\$
$X$ is a m by n+1 design matrix containing all input vectors; $y$ is the m by 1 vector to predict
$\theta=(X^TX)^{-1}X^Ty$
$(X^TX)^{-1}X^T$ is called Pseudoinverse of $X$

deduction

$J({\theta})=\frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2 \\ \hspace{0.9 cm} = \frac{1}{2m}\sum_{i=1}^m((x^{(i)})^T\theta-y^{(i)})^2 \\ \hspace{0.9 cm} = \frac{1}{2m} \| X\theta-y \|^2 \\ \hspace{0.9 cm} = \frac{1}{2m} (X\theta-y)^T(X\theta-y) \\ \hspace{0.9 cm} = \frac{1}{2m}（X\theta)^T(X\theta)- （X\theta)^Ty - y^TX\theta +y^Ty \\ \hspace{0.9 cm} = \frac{1}{2m}（X\theta)^T(X\theta)- 2（X\theta)^Ty+y^Ty$
$\\$
$\frac{dJ(\theta)}{d\theta} = \frac{1}{m} (X^TX\theta-X^Ty) = 0$
$\rightarrow$ $\theta =(X^TX)^{-1}X^Ty$

derivative of matrix

$\frac{\partial \beta^TX}{\partial X} = \beta$
$\frac{\partial X^TX}{\partial X} = X$
$\frac{\partial X^TAX}{\partial X} = (A+A^T)X$

derivative:
matrix calculus wiki
trace: $tr(A)=\sum_{i=1}^nA_{ii}$ trace wiki

compare with gradient descent

gradient descent	normal equation
need to choose $\alpha$	no need to choose $\alpha$
need many iterations	don’t need to iterate
works well even when n is large	compute n*n matrix
N/A	slow if n is very large
$O（kn^2)$	$O（n^3)$
n>10000	smaller n

normal equation noninvertibility

what if $X^TX$ is non-invertible? (singular)
in Octave, two function do the inversion: inv() and pinv(); pinv() will solve the problem

reason of noninvertibility:

redundant features(linearly dependent)
- like size in square meter and size in square feet
too many features(e.g. )
- delete some features or use regularization

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1

Stanford ML 2 Multivariate Regression and Normal Equation