Density Estimation
Problem Motivation
Given dataset
Is anomalous?
Build a probability model based on dataset.
Given a test data, if flag anomaly
Gaussian Distribution
\[x \sim N(\mu,\sigma^2)\]
\[p(x;\mu,\sigma^2)=\frac{1}{\sqrt{2 \pi} \sigma} \exp{\left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)}\]
\[\mu = \frac{1}{m}\sum_{i=1}^{m} x^{(i)}\]
\[\sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\mu)^2\]
Algorithm
Training dataset
Each example is
\[x_i \sim N(\mu_i,\sigma_i^2)\]
\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]
Algorithm:
- Choose features that might be indicative of anomalous examples
- Fit parameters
\[\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}\]
\[\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)}-\mu_j)^2\] - Given new example x, compute . Anomaly if
\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]
Building an Anomaly Detection System
Developing and Evaluating an Anomaly Detection System
Fit model on training set(use negative examples)
On a cross validation/test example, do the prediction
Possible evaluation metrics:
- True positive, false positive
- Precision/Recall
- F1-score
Can use cross validation set to choose parameter
Anomaly Detection vs. Supervised Learning
Anomaly detection:
- Very small number of positive examples (y=1). (0-20 is common).
- Large number of negative examples (y=0).
- Many different types of anomalies. Future anomalies may look nothing like any of the previous anomalous examples.
Supervised learning:
- large number of positive and negative examples.
- Enough positive examples for algorithm to get a sense of what positive examples are like.
Choosing what Features to Use
Plot the histogram of data, check if Gaussian. If not, perform some transformation to be Gaussian.
Error analysis
Want large for normal examples; small for anomalous examples.
Most common problem: both are large. Can choose or create (e.g. x4/x5) more features to solve.
Multivariate Gaussian Distribution
. Don’t model etc. separately. Model all in one go.
Parameters: , (covariance matrix)
\[p(x;\mu,\Sigma)=\frac{1}{(2 \pi)^{n/2} \vert \Sigma \vert^{1/2}} \exp{\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right)}\]
Anomaly detection
Parameter fitting:
\[\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}\]
\[\Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}-\mu)(x^{(i)}-\mu)^T\]
Flag an anomaly if
- Automatically captures correlations between features.
- Computationally more expensive.
- Must have , or else is non-invertible. Better have
Predicting Movie Ratings
Problem Formulation
: no. users
: no. movies
: if user j has rated movie i
: rating (0-5)
Content Based Recommendations
Have parameters for each users for defined features
Minimize cost function
Use gradient descent, conjugate gradient or LBFGS
Collaborative Filtering
learn what features to use by itself.
Given , to learn :
\[min_{x^{(i)},…,x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2\]
Guess , finally converges.
Algorithm
Minimizing and simultaneously:
\[J(x^{(1)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}(\theta_k^{(j)})^2\]
and , bias term 0 is not used.
- Initialize to small random values
- Minimize using gradient descent (or an advanced optimization algorithm). E.g. for every :
\[x_k^{(i)} =: x_k^{(i)} -\alpha \left( \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) \theta_k^{(j)}+\lambda x_k^{(i)} \right)\]
\[\theta_k^{(j)} =: \theta_k^{(j)} -\alpha \left( \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) x_k^{(i)}+\lambda \theta_k^{(j)} \right)\]
Low Rank Matrix Factorization
Vectorization
Grouping all data into a matrix
Predict for
Let and
\[Y=X\Theta^T\]
Similar movies given by smallest
Implementation Detail: Mean Normalization
For a new user who haven’t rated any movie.
Calculate row mean for then subtract with .
Use this to learn
For user on movie predict:
\[(\theta^{(j)})^T(x^{(i)}) +\mu_i\]