2017-02-05

# Density Estimation

## Problem Motivation

Given dataset $\{x^{(1)},x^{(2)},...,x^{(m)}\}$
Is $x_{test}$ anomalous?

Build a probability model $p(x)$ based on dataset.
Given a test data, if $p(x_{test}) \lt \epsilon$ $\rightarrow$ flag anomaly

## Gaussian Distribution

$x \sim N(\mu,\sigma^2)$

$p(x;\mu,\sigma^2)=\frac{1}{\sqrt{2 \pi} \sigma} \exp{\left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)}$

$\mu = \frac{1}{m}\sum_{i=1}^{m} x^{(i)}$

$\sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\mu)^2$

## Algorithm

Training dataset $\{x^{(1)},x^{(2)},...,x^{(m)}\}$
Each example is $x \in \mathbb{R}^n$

$x_i \sim N(\mu_i,\sigma_i^2)$

$p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)$

Algorithm:

1. Choose features $x_i$ that might be indicative of anomalous examples
2. Fit parameters $\mu_1,...,\mu_n,\sigma_1^2,...,\sigma_n^2$
$\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}$
$\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)}-\mu_j)^2$
3. Given new example x, compute $p(x)$. Anomaly if $p(x) \lt \epsilon$
$p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)$

# Building an Anomaly Detection System

## Developing and Evaluating an Anomaly Detection System

Fit model on training set(use negative examples)
On a cross validation/test example, do the prediction

Possible evaluation metrics:

• True positive, false positive
• Precision/Recall
• F1-score

Can use cross validation set to choose parameter $\epsilon$

## Anomaly Detection vs. Supervised Learning

Anomaly detection:

• Very small number of positive examples (y=1). (0-20 is common).
• Large number of negative examples (y=0).
• Many different types of anomalies. Future anomalies may look nothing like any of the previous anomalous examples.

Supervised learning:

• large number of positive and negative examples.
• Enough positive examples for algorithm to get a sense of what positive examples are like.

## Choosing what Features to Use

Plot the histogram of data, check if Gaussian. If not, perform some transformation to be Gaussian.

### Error analysis

Want $p(x)$ large for normal examples; small for anomalous examples.

Most common problem: both are large. Can choose or create (e.g. x4/x5) more features to solve.

# Multivariate Gaussian Distribution

$x \in \mathbb{R}^n$. Don’t model $p(x_1),p(x_2),...$ etc. separately. Model $p(x)$ all in one go.
Parameters: $\mu \in \mathbb{R}^n$, $\Sigma \in \mathbb{R}^{n \times n}$ (covariance matrix)

$p(x;\mu,\Sigma)=\frac{1}{(2 \pi)^{n/2} \vert \Sigma \vert^{1/2}} \exp{\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right)}$

## Anomaly detection

Parameter fitting:

$\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}$

$\Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}-\mu)(x^{(i)}-\mu)^T$

Flag an anomaly if $p(x) \lt \epsilon$

1. Automatically captures correlations between features.
2. Computationally more expensive.
3. Must have $m \gt n$, or else $\Sigma$ is non-invertible. Better have $m \ge 10n$

# Predicting Movie Ratings

## Problem Formulation

$n_u$: no. users
$n_m$: no. movies
$r(i,j)=1$: if user j has rated movie i
$y^{(i,j)}$: rating (0-5)

## Content Based Recommendations

Have parameters for each users for defined features
Minimize cost function

# Collaborative Filtering

learn what features to use by itself.

Given $\theta^{(1)},...,\theta^{(n_u)}$, to learn $x^{(1)},x^{(2)},...,x^{(n_m)}$:
$min_{x^{(i)},…,x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2$

Guess $\theta \rightarrow x \rightarrow \theta \rightarrow x...$, finally converges.

## Algorithm

Minimizing $\theta^{(1)},...,\theta^{(n_u)}$ and $x^{(1)},x^{(2)},...,x^{(n_m)}$ simultaneously:

$J(x^{(1)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}(\theta_k^{(j)})^2$

$x \in \mathbb{R}^n$ and $\theta \in \mathbb{R}^n$, bias term 0 is not used.

1. Initialize $x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}$ to small random values
2. Minimize $J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$ using gradient descent (or an advanced optimization algorithm). E.g. for every $j = 1,...,n_u,i=1,...,n_m$:
$x_k^{(i)} =: x_k^{(i)} -\alpha \left( \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) \theta_k^{(j)}+\lambda x_k^{(i)} \right)$
$\theta_k^{(j)} =: \theta_k^{(j)} -\alpha \left( \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) x_k^{(i)}+\lambda \theta_k^{(j)} \right)$

# Low Rank Matrix Factorization

## Vectorization

Grouping all data $y(i,j)$ into a matrix $Y$

Predict $(\theta^{(j)})^T x^{(i)}$ for $(i,j)$

Let $X = [(x^{(1)})^T;...;(x^{(n_m)})^T]$ and $\Theta = [(\theta^{(1)})^T;...;(\theta^{(n_u)})^T]$

$Y=X\Theta^T$

Similar movies given by smallest $\vert \vert x^{(i)} - x^{(j)} \vert \vert$

## Implementation Detail: Mean Normalization

For a new user who haven’t rated any movie.

Calculate row mean $\mu$ for $Y$ then subtract $Y$ with $\mu$.

Use this $Y$ to learn $\theta^{(j)},x^{(i)}$

For user $j$ on movie $i$ predict:
$(\theta^{(j)})^T(x^{(i)}) +\mu_i$