Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Stanford ML 9 Anomaly Detection and Recommender Systems

2017-02-05

Density Estimation

Problem Motivation

Given dataset
Is anomalous?

Build a probability model based on dataset.
Given a test data, if flag anomaly

Gaussian Distribution

\[x \sim N(\mu,\sigma^2)\]

\[p(x;\mu,\sigma^2)=\frac{1}{\sqrt{2 \pi} \sigma} \exp{\left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)}\]

\[\mu = \frac{1}{m}\sum_{i=1}^{m} x^{(i)}\]

\[\sigma^2 = \frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\mu)^2\]

Algorithm

Training dataset
Each example is

\[x_i \sim N(\mu_i,\sigma_i^2)\]

\[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]

Algorithm:

  1. Choose features that might be indicative of anomalous examples
  2. Fit parameters
    \[\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_j^{(i)}\]
    \[\sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m} (x_j^{(i)}-\mu_j)^2\]
  3. Given new example x, compute . Anomaly if
    \[p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2)\]

Building an Anomaly Detection System

Developing and Evaluating an Anomaly Detection System

Fit model on training set(use negative examples)
On a cross validation/test example, do the prediction

Possible evaluation metrics:

  • True positive, false positive
  • Precision/Recall
  • F1-score

Can use cross validation set to choose parameter

Anomaly Detection vs. Supervised Learning

Anomaly detection:

  • Very small number of positive examples (y=1). (0-20 is common).
  • Large number of negative examples (y=0).
  • Many different types of anomalies. Future anomalies may look nothing like any of the previous anomalous examples.

Supervised learning:

  • large number of positive and negative examples.
  • Enough positive examples for algorithm to get a sense of what positive examples are like.

Choosing what Features to Use

Plot the histogram of data, check if Gaussian. If not, perform some transformation to be Gaussian.

Error analysis

Want large for normal examples; small for anomalous examples.

Most common problem: both are large. Can choose or create (e.g. x4/x5) more features to solve.

Multivariate Gaussian Distribution

. Don’t model etc. separately. Model all in one go.
Parameters: , (covariance matrix)

\[p(x;\mu,\Sigma)=\frac{1}{(2 \pi)^{n/2} \vert \Sigma \vert^{1/2}} \exp{\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right)}\]

Anomaly detection

Parameter fitting:

\[\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}\]

\[\Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}-\mu)(x^{(i)}-\mu)^T\]

Flag an anomaly if

  1. Automatically captures correlations between features.
  2. Computationally more expensive.
  3. Must have , or else is non-invertible. Better have

Predicting Movie Ratings

Problem Formulation

: no. users
: no. movies
: if user j has rated movie i
: rating (0-5)

Content Based Recommendations

Have parameters for each users for defined features
Minimize cost function
Use gradient descent, conjugate gradient or LBFGS

Collaborative Filtering

learn what features to use by itself.

Given , to learn :
\[min_{x^{(i)},…,x^{(n_m)}} \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2\]

Guess , finally converges.

Algorithm

Minimizing and simultaneously:

\[J(x^{(1)},…,x^{(n_m)},\theta^{(1)},…,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n}(x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^{n}(\theta_k^{(j)})^2\]

and , bias term 0 is not used.

  1. Initialize to small random values
  2. Minimize using gradient descent (or an advanced optimization algorithm). E.g. for every :
    \[x_k^{(i)} =: x_k^{(i)} -\alpha \left( \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) \theta_k^{(j)}+\lambda x_k^{(i)} \right)\]
    \[\theta_k^{(j)} =: \theta_k^{(j)} -\alpha \left( \sum_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)}) x_k^{(i)}+\lambda \theta_k^{(j)} \right)\]

Low Rank Matrix Factorization

Vectorization

Grouping all data into a matrix

Predict for

Let and

\[Y=X\Theta^T\]

Similar movies given by smallest

Implementation Detail: Mean Normalization

For a new user who haven’t rated any movie.

Calculate row mean for then subtract with .

Use this to learn

For user on movie predict:
\[(\theta^{(j)})^T(x^{(i)}) +\mu_i\]


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.