Density Estimation
Problem Motivation
Given dataset   
Is  anomalous?
Build a probability model  based on dataset.  
Given a test data, if   flag anomaly
Gaussian Distribution
\[x \sim N(\mu,\sigma^2)\]
\[p(x;\mu,\sigma^2)=\frac{1}{\sqrt{2 \pi} \sigma} \exp{\left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)}\]
\[\mu = \frac{1}{m}\sum_{i=1}^{m} x^{(i)}\]
\[\sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\mu)^2\]
Algorithm
Training dataset   
Each example is 
\[x_i \sim N(\mu_i,\sigma_i^2)\]
\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]
Algorithm:
- Choose features that might be indicative of anomalous examples
 - Fit parameters   
\[\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}\]
\[\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)}-\mu_j)^2\] - Given new example x, compute . Anomaly if   
\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\] 
Building an Anomaly Detection System
Developing and Evaluating an Anomaly Detection System
Fit model on training set(use negative examples)  
On a cross validation/test example, do the prediction
Possible evaluation metrics:
- True positive, false positive
 - Precision/Recall
 - F1-score
 
Can use cross validation set to choose parameter
Anomaly Detection vs. Supervised Learning
Anomaly detection:
- Very small number of positive examples (y=1). (0-20 is common).
 - Large number of negative examples (y=0).
 - Many different types of anomalies. Future anomalies may look nothing like any of the previous anomalous examples.
 
Supervised learning:
- large number of positive and negative examples.
 - Enough positive examples for algorithm to get a sense of what positive examples are like.
 
Choosing what Features to Use
Plot the histogram of data, check if Gaussian. If not, perform some transformation to be Gaussian.
Error analysis
Want large for normal examples; small for anomalous examples.
Most common problem: both are large. Can choose or create (e.g. x4/x5) more features to solve.
Multivariate Gaussian Distribution
. Don’t model  etc. separately. Model  all in one go.  
Parameters: ,  (covariance matrix)
\[p(x;\mu,\Sigma)=\frac{1}{(2 \pi)^{n/2} \vert \Sigma \vert^{1/2}} \exp{\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right)}\]
Anomaly detection
Parameter fitting:
\[\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}\]
\[\Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}-\mu)(x^{(i)}-\mu)^T\]
Flag an anomaly if
- Automatically captures correlations between features.
 - Computationally more expensive.
 - Must have , or else is non-invertible. Better have
 
Predicting Movie Ratings
Problem Formulation
: no. users  
: no. movies  
: if user j has rated movie i  
: rating (0-5)
Content Based Recommendations
Have parameters for each users for defined features  
Minimize cost function  
Use gradient descent, conjugate gradient or LBFGS
Collaborative Filtering
learn what features to use by itself.
Given , to learn :  
\[min_{x^{(i)},…,x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2\]
Guess , finally converges.
Algorithm
Minimizing and simultaneously:
\[J(x^{(1)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}(\theta_k^{(j)})^2\]
and , bias term 0 is not used.
- Initialize to small random values
 - Minimize  using gradient descent (or an advanced optimization algorithm). E.g. for every :  
\[x_k^{(i)} =: x_k^{(i)} -\alpha \left( \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) \theta_k^{(j)}+\lambda x_k^{(i)} \right)\]
\[\theta_k^{(j)} =: \theta_k^{(j)} -\alpha \left( \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) x_k^{(i)}+\lambda \theta_k^{(j)} \right)\] 
Low Rank Matrix Factorization
Vectorization
Grouping all data into a matrix
Predict for
Let and
\[Y=X\Theta^T\]
Similar movies given by smallest
Implementation Detail: Mean Normalization
For a new user who haven’t rated any movie.
Calculate row mean for then subtract with .
Use this to learn
For user  on movie  predict:
\[(\theta^{(j)})^T(x^{(i)}) +\mu_i\]
