Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems

All chapters

Reliable, Scalable, and Maintainable Applications

Reliability

Work correctly even with hardware, software or human faults.

Hardware: adding redundancy; or using software fault-tolerance techniques (rolling upgrade)
Software: making wrong assumptions about its environment
- Solution: rethinking; testing; process isolation; allowing processes to crash and restart; measuring; monitoring and analyzing
Human
- Minimize opportunities for error, good API
- Decouple the places where people make the most mistakes from the places where they can cause failures
- Test thoroughly at all levels, unit, integration and manual
- Allow quick and easy recovery from human errors
- Set up detailed and clear monitoring, such as performance metrics and error rates
- Good management practices and training

Scalability

Deal with system growth.

System’s ability to cope with increased load
Describe the load
- load parameters:
  - rps to a web server
  - ratio of reads to writes in a DB
  - number of active users in chat room
  - hit rate of a cache
  - etc.
Describe performance
- batch processing
  - throughput (number of records processed per second)
  - total time to run a job on a dataset of a certain size
- online systems
  - service’s response time (pretty random)
    - better use percentiles (99.9% p999 important but hard to optimize)
    - should be measured on client side
      - send all requests independently
    - head-of-line blocking: a small number of slow requests can block all subsequent ones on the server
Coping with load
- rethink your architecture on every order of magnitude load increase
- scale up or out
- scale elastically or manually
- system is highly specific at large scale

Twitter example

Twitter has two operations:

post tweet (4.6k rps avg., 12k at peak, 2012)
home timeline(300k rps)

The company uses a hybrid of two methods for their system.

maintain a global collections of tweets
- post: insert into the collection
- home timeline: look up all people a user follows
maintain a cache for each user’s timeline
- post: look up the followers and insert the tweet into their time line
  - might be very expensive if too many followers
    - use method 1 for celebrities
- home timeline: cheap

Maintainability

Many people can work on it productively to maintain.

Operability
- Make it easy for operations teams to keep the system running smoothly
- Having good visibility into the system’s health, and having effective ways of managing it
Simplicity
- Make it easy for new engineers to understand the system
- Good abstractions can help reduce complexity
Evolvability
- Make it easy for engineers to make changes to the system in the future

Data Models and Query Languages

Data models

Relational model
- Query optimizer decides automatically the execution order, instead of to be handled by application codes
- better support for joins, many-to-one and many-to-many relations
Document model
- necessary because of: greater scalability, free & open source, specialized queries, schema flexibility, better performance due to locality, closer to application data structures
- suitable for storing objects
- Joins are not needed for one-to-many tree structures, and support for joins is often weak.
  - may need multiple queries in application code

Use which one?

Document model
- good when application data has a document-like structure
  - Shredding in relational model (split structure by column into multiple tables) is bad for this case.
- performs bad if refer a deeply nested item within a document
- poor if needs joins (can be done with multiple queries)
Relational model
- supports many-to-many relations
Graph model
- highly interconnected data
Schema flexibility
- schema on read, like dynamic type checking
Data locality for queries
- A document is usually stored as a single continuous string.
- good if need to read entire or large part of a document
- recommend to keep the document size small
  - on update, the document needs to be rewritten (unless same size before and after)
- Can be used by other model
  - Google’s Spanner DB, nested with parent
  - Oracle, multi-table index cluster tables
  - column-family used by Cassandra and HBase

Relational DB and Document DB become more and more similar. A good route in the future may be a hybrid of both.

Most relational DB (other than MySQL) support XML, similar to document DB. PostgreSQL, MySQL and IBM DB2 support JSON.
RethinkDB and MongoDB support relational-like joins and resolve DB references automatically in their queries.

Query languages for data

Imperative code
- tells the computer to perform certain operations in a certain order
- many programming languages
Declarative code
- just specify the pattern of data you want, but not how to achieve that goal
  - decided by query optimizer for SQL
- SQL, relational algebra; web: CSS
- possible to improve performance without changing queries
- parallel execution

MapReduce

Process large amounts of data in bulk across many machines.
Limited support by some NoSQL, like MongoDB (aggregation pipeline), CouchDB.
A thing between imperative and declarative models.

Graph-like data model

Good support on many-to-many relationships
Vertices and edges
- types of vertices can be different
Algorithms
- PageRank on web graph
- shortest path
Not CODASYL or network model

Different types of graph models:

Property graphs
- each vertex consists of
  - unique ID
  - outgoing edges
  - incoming edges
  - properties (k-v)
- each edge:
  - unique ID
  - start vertex
  - end vertex
  - label for the relationship between two vertices
  - properties (k-v)
Triple-store model
- mostly equivalent to property graph
- all info stored in 3-part statements: (subject, predicate, object)
  - e.g., (Jim, likes, bananas)
- subject is vertex
- object is either
  - a string, number, thus (predicate, object) is a property
  - or another vertex, thus an edge
- semantic web: the internet of machine-readable data
  - RDF: Resource Description Framework
    - a mechanism for different web-sites to publish data in a consistent format

Query languages:

Cypher QL
- declarative QL for property graph
Graph Queries in SQL
- query the graph with SQL if we put graph data in a relational structure
- difficulty: number of joins is not fixed in advance
  - may need to traverse a number of edges to find the vertex
  - e.g., query all people living in country 1: a person -lives in-> street 1 -in-> city 1 -in-> state 1 -in-> country 1
SPARQL
- a query language for triple-stores using the RDF data model
Datalog: old foundation

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 16

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 16

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 16

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1