2017-02-07

# Gradient Descent with Large Datasets

## Learning with Large Datasets

Plot $J_{train}$ and $J_{CV}$ vs training set size to estimate the type of problem (high bias or variance)

Use 1 example in each iteration

1. Randomly shuffle dataset
2. Repeat {
for $i = 1,2,...,m$ {
$\theta_j := \theta_j - \alpha (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}$
(for $j=0,1,...n$)
}
}

Use b examples in each iteration
b: mini-batch size (2-100)
Can be vectorized

Say $b=10, m = 1000$

1. Randomly shuffle dataset
2. Repeat {
for $i = 1,11,21,...,991$ {
$\theta_j := \theta_j - \alpha \frac{1}{10}\sum_{k=i}^{i+9}(h_{\theta}(x^{(k)})-y^{(k)})x_j^{(k)}$
(for $j=0,1,...n$)
}
}

During learning, compute $cost(\theta,(x^{(i)},y^{(i)}))$ before updating $\theta$ using $(x^{(i)},y^{(i)})$.
Every 1000 iterations (say), plot $cost(\theta,(x^{(i)},y^{(i)}))$ averaged over the last 1000 examples processed by algorithm.
Can slowly decrease $\alpha$ over time to improve convergence like:
$\alpha = const1/(NoIteration + const2)$

## Online Learning

### Algorithm:

Repeat forever {
Get $(x,y)$ corresponding to user
Update $\theta$ using $(x,y)$:
$\theta_j := \theta_j - \alpha (h_{\theta}(x)-y)x_j \mbox{ for } (j=0,1,…,n)$