# Density Estimation

## Problem Motivation

Given dataset

Is anomalous?

Build a probability model based on dataset.

Given a test data, if flag anomaly

## Gaussian Distribution

\[x \sim N(\mu,\sigma^2)\]

\[p(x;\mu,\sigma^2)=\frac{1}{\sqrt{2 \pi} \sigma} \exp{\left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)}\]

\[\mu = \frac{1}{m}\sum_{i=1}^{m} x^{(i)}\]

\[\sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\mu)^2\]

## Algorithm

Training dataset

Each example is

\[x_i \sim N(\mu_i,\sigma_i^2)\]

\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]

Algorithm:

- Choose features that might be indicative of anomalous examples
- Fit parameters

\[\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}\]

\[\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)}-\mu_j)^2\] - Given new example x, compute . Anomaly if

\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]

# Building an Anomaly Detection System

## Developing and Evaluating an Anomaly Detection System

Fit model on training set(use negative examples)

On a cross validation/test example, do the prediction

Possible evaluation metrics:

- True positive, false positive
- Precision/Recall
- F1-score

Can use cross validation set to choose parameter

## Anomaly Detection vs. Supervised Learning

Anomaly detection:

- Very small number of positive examples (y=1). (0-20 is common).
- Large number of negative examples (y=0).
- Many different types of anomalies. Future anomalies may look nothing like any of the previous anomalous examples.

Supervised learning:

- large number of positive and negative examples.
- Enough positive examples for algorithm to get a sense of what positive examples are like.

## Choosing what Features to Use

Plot the histogram of data, check if Gaussian. If not, perform some transformation to be Gaussian.

### Error analysis

Want large for normal examples; small for anomalous examples.

Most common problem: both are large. Can choose or create (e.g. x4/x5) more features to solve.

# Multivariate Gaussian Distribution

. Don’t model etc. separately. Model all in one go.

Parameters: , (covariance matrix)

\[p(x;\mu,\Sigma)=\frac{1}{(2 \pi)^{n/2} \vert \Sigma \vert^{1/2}} \exp{\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right)}\]

## Anomaly detection

Parameter fitting:

\[\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}\]

\[\Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}-\mu)(x^{(i)}-\mu)^T\]

Flag an anomaly if

- Automatically captures correlations between features.
- Computationally more expensive.
- Must have , or else is non-invertible. Better have

# Predicting Movie Ratings

## Problem Formulation

: no. users

: no. movies

: if user j has rated movie i

: rating (0-5)

## Content Based Recommendations

Have parameters for each users for defined features

Minimize cost function

Use gradient descent, conjugate gradient or LBFGS

# Collaborative Filtering

learn what features to use by itself.

Given , to learn :

\[min_{x^{(i)},…,x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2\]

Guess , finally converges.

## Algorithm

Minimizing and simultaneously:

\[J(x^{(1)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}(\theta_k^{(j)})^2\]

and , bias term 0 is not used.

- Initialize to small random values
- Minimize using gradient descent (or an advanced optimization algorithm). E.g. for every :

\[x_k^{(i)} =: x_k^{(i)} -\alpha \left( \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) \theta_k^{(j)}+\lambda x_k^{(i)} \right)\]

\[\theta_k^{(j)} =: \theta_k^{(j)} -\alpha \left( \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) x_k^{(i)}+\lambda \theta_k^{(j)} \right)\]

# Low Rank Matrix Factorization

## Vectorization

Grouping all data into a matrix

Predict for

Let and

\[Y=X\Theta^T\]

Similar movies given by smallest

## Implementation Detail: Mean Normalization

For a new user who haven’t rated any movie.

Calculate row mean for then subtract with .

Use this to learn

For user on movie predict:

\[(\theta^{(j)})^T(x^{(i)}) +\mu_i\]