Density Estimation

Problem Motivation

Given dataset $\{x^{(1)},x^{(2)},...,x^{(m)}\}$
Is $x_{test}$ anomalous?

Build a probability model $p(x)$ based on dataset.
Given a test data, if $p(x_{test}) \lt \epsilon$ $\rightarrow$ flag anomaly

Gaussian Distribution

\[x \sim N(\mu,\sigma^2)\]

\[p(x;\mu,\sigma^2)=\frac{1}{\sqrt{2 \pi} \sigma} \exp{\left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)}\]

\[\mu = \frac{1}{m}\sum_{i=1}^{m} x^{(i)}\]

\[\sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\mu)^2\]

Algorithm

Training dataset $\{x^{(1)},x^{(2)},...,x^{(m)}\}$
Each example is $x \in \mathbb{R}^n$

\[x_i \sim N(\mu_i,\sigma_i^2)\]

\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]

Algorithm:

Choose features $x_i$ that might be indicative of anomalous examples
Fit parameters $\mu_1,...,\mu_n,\sigma_1^2,...,\sigma_n^2$
\[\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}\]
\[\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)}-\mu_j)^2\]
Given new example x, compute $p(x)$ . Anomaly if $p(x) \lt \epsilon$
\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]

Building an Anomaly Detection System

Developing and Evaluating an Anomaly Detection System

Fit model on training set(use negative examples)
On a cross validation/test example, do the prediction

Possible evaluation metrics:

True positive, false positive
Precision/Recall
F1-score

Can use cross validation set to choose parameter $\epsilon$

Anomaly Detection vs. Supervised Learning

Anomaly detection:

Very small number of positive examples (y=1). (0-20 is common).
Large number of negative examples (y=0).
Many different types of anomalies. Future anomalies may look nothing like any of the previous anomalous examples.

Supervised learning:

large number of positive and negative examples.
Enough positive examples for algorithm to get a sense of what positive examples are like.

Choosing what Features to Use

Plot the histogram of data, check if Gaussian. If not, perform some transformation to be Gaussian.

Error analysis

Want $p(x)$ large for normal examples; small for anomalous examples.

Most common problem: both are large. Can choose or create (e.g. x4/x5) more features to solve.

Multivariate Gaussian Distribution

$x \in \mathbb{R}^n$ . Don’t model $p(x_1),p(x_2),...$ etc. separately. Model $p(x)$ all in one go.
Parameters: $\mu \in \mathbb{R}^n$ , $\Sigma \in \mathbb{R}^{n \times n}$ (covariance matrix)

\[p(x;\mu,\Sigma)=\frac{1}{(2 \pi)^{n/2} \vert \Sigma \vert^{1/2}} \exp{\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right)}\]

Anomaly detection

Parameter fitting:

\[\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}\]

\[\Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}-\mu)(x^{(i)}-\mu)^T\]

Flag an anomaly if $p(x) \lt \epsilon$

Automatically captures correlations between features.
Computationally more expensive.
Must have $m \gt n$ , or else $\Sigma$ is non-invertible. Better have $m \ge 10n$

Predicting Movie Ratings

Problem Formulation

$n_u$ : no. users
$n_m$ : no. movies
$r(i,j)=1$ : if user j has rated movie i
$y^{(i,j)}$ : rating (0-5)

Content Based Recommendations

Have parameters for each users for defined features
Minimize cost function
Use gradient descent, conjugate gradient or LBFGS

Collaborative Filtering

learn what features to use by itself.

Given $\theta^{(1)},...,\theta^{(n_u)}$ , to learn $x^{(1)},x^{(2)},...,x^{(n_m)}$ :
\[min_{x^{(i)},…,x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2\]

Guess $\theta \rightarrow x \rightarrow \theta \rightarrow x...$ , finally converges.

Algorithm

Minimizing $\theta^{(1)},...,\theta^{(n_u)}$ and $x^{(1)},x^{(2)},...,x^{(n_m)}$ simultaneously:

\[J(x^{(1)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}(\theta_k^{(j)})^2\]

$x \in \mathbb{R}^n$ and $\theta \in \mathbb{R}^n$ , bias term 0 is not used.

Initialize $x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}$ to small random values
Minimize $J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$ using gradient descent (or an advanced optimization algorithm). E.g. for every $j = 1,...,n_u,i=1,...,n_m$ :
\[x_k^{(i)} =: x_k^{(i)} -\alpha \left( \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) \theta_k^{(j)}+\lambda x_k^{(i)} \right)\]
\[\theta_k^{(j)} =: \theta_k^{(j)} -\alpha \left( \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) x_k^{(i)}+\lambda \theta_k^{(j)} \right)\]

Low Rank Matrix Factorization

Vectorization

Grouping all data $y(i,j)$ into a matrix $Y$

Predict $(\theta^{(j)})^T x^{(i)}$ for $(i,j)$

Let $X = [(x^{(1)})^T;...;(x^{(n_m)})^T]$ and $\Theta = [(\theta^{(1)})^T;...;(\theta^{(n_u)})^T]$

\[Y=X\Theta^T\]

Similar movies given by smallest $\vert \vert x^{(i)} - x^{(j)} \vert \vert$

Implementation Detail: Mean Normalization

For a new user who haven’t rated any movie.

Calculate row mean $\mu$ for $Y$ then subtract $Y$ with $\mu$ .

Use this $Y$ to learn $\theta^{(j)},x^{(i)}$

For user $j$ on movie $i$ predict:
\[(\theta^{(j)})^T(x^{(i)}) +\mu_i\]

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1

Stanford ML 9 Anomaly Detection and Recommender Systems

Density Estimation

Problem Motivation

Gaussian Distribution

Algorithm

Building an Anomaly Detection System

Developing and Evaluating an Anomaly Detection System

Anomaly Detection vs. Supervised Learning

Choosing what Features to Use

Error analysis

Multivariate Gaussian Distribution

Anomaly detection

Predicting Movie Ratings

Problem Formulation

Content Based Recommendations

Collaborative Filtering

Algorithm

Low Rank Matrix Factorization

Vectorization

Implementation Detail: Mean Normalization