Gradient Descent with Large Datasets
Learning with Large Datasets
Plot and vs training set size to estimate the type of problem (high bias or variance)
Stochastic Gradient Descent
Use 1 example in each iteration
- Randomly shuffle dataset
- Repeat {
for {
\[\theta_j := \theta_j - \alpha (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}\]
(for )
}
}
Mini-Batch Gradient Descent
Use b examples in each iteration
b: mini-batch size (2-100)
Can be vectorized
Say
- Randomly shuffle dataset
- Repeat {
for {
\[\theta_j := \theta_j - \alpha \frac{1}{10}\sum_{k=i}^{i+9}(h_{\theta}(x^{(k)})-y^{(k)})x_j^{(k)}\]
(for )
}
}
Stochastic Gradient Descent Convergence
During learning, compute before updating using .
Every 1000 iterations (say), plot averaged over the last 1000 examples processed by algorithm.
Can slowly decrease over time to improve convergence like:
Advanced Topics
Online Learning
Adapt to changing user perference
Algorithm:
Repeat forever {
Get corresponding to user
Update using :
\[\theta_j := \theta_j - \alpha (h_{\theta}(x)-y)x_j \mbox{ for } (j=0,1,…,n)\]
Discard this example
}
Map Reduce and Data Parallelism
For each iteration, divide the training set into several portions; use different machines to calculate summation of functions for all portions, then sum the results to update the parameters on the central server.