Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Stanford ML 2 Multivariate Regression and Normal Equation


Multivariate linear regression

multiple features


  • n = number of features
  • \(x^{(i)}\) = input (features) of training example
  • = value of feature j in training example



  • ;
  • ;
  • ; (the training examples are stored in X row-wise
    • for a case with 3 examples and 2 features:
    • the hypothesis is given as

gradient descent for multiple variables

parameters: , a n+1 dimensional vector
cost function:
gradient descent:
repeat until convergence {

} (simutaneously update for every j = 0,…, n)

new algorithm:
repeat until convergence {

} (simutaneously update for every j = 0,…, n)

gradient descent in practice - feature scaling

make gradient descent work faster

feature scaling

idea: make sure features are on a similar scale like normalize all features
get every feature into approximately a range

mean normalization

replace with to make features have approximately zero mean (do not apply to )
or use ( is the range or std)

gradient descent in practice - learning rate


how to make sure gradient descent is working correctly?
plot vs no. of iterations
should decrease after every iteration
declare convergence if decreases by less than a small value like in one iteration
if is increasing, try to use smaller
if is too small, convergence will be slow

learning rate

how to choose learning rate
try …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1,…

features and polynomial regression

if fit the cubic polynomial, create new features :

the can be arbitrary function, like
do the mean normalization and scaling carefully

computing parameters analytically

normal equation

method to solve for analytically

solve for
feature scaling is not required when using normal equation

is a m by n+1 design matrix containing all input vectors; is the m by 1 vector to predict

is called Pseudoinverse of


derivative of matrix

matrix calculus wiki
trace: trace wiki

compare with gradient descent

gradient descent normal equation
need to choose no need to choose
need many iterations don’t need to iterate
works well even when n is large compute n*n matrix
N/A slow if n is very large
n>10000 smaller n

normal equation noninvertibility

what if is non-invertible? (singular)
in Octave, two function do the inversion: inv() and pinv(); pinv() will solve the problem

reason of noninvertibility:

  • redundant features(linearly dependent)
    • like size in square meter and size in square feet
  • too many features(e.g. )
    • delete some features or use regularization

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2019. All rights reserved by melonskin. Powered by Jekyll.