All chapters
- Reliable, Scalable, and Maintainable Applications
- Data Models and Query Languages
- Storage and Retrieval
- Encoding and Evolution
- Replication
- Partitioning
- Transactions
- Distributed System Troubles
- Consistency and Consensus
- Batch Processing
- Stream Processing
- The Future of Data Systems
Reliable, Scalable, and Maintainable Applications
Reliability
Work correctly even with hardware, software or human faults.
- Hardware: adding redundancy; or using software fault-tolerance techniques (rolling upgrade)
- Software: making wrong assumptions about its environment
- Solution: rethinking; testing; process isolation; allowing processes to crash and restart; measuring; monitoring and analyzing
- Human
- Minimize opportunities for error, good API
- Decouple the places where people make the most mistakes from the places where they can cause failures
- Test thoroughly at all levels, unit, integration and manual
- Allow quick and easy recovery from human errors
- Set up detailed and clear monitoring, such as performance metrics and error rates
- Good management practices and training
Scalability
Deal with system growth.
- System’s ability to cope with increased load
- Describe the load
- load parameters:
- rps to a web server
- ratio of reads to writes in a DB
- number of active users in chat room
- hit rate of a cache
- etc.
- load parameters:
- Describe performance
- batch processing
- throughput (number of records processed per second)
- total time to run a job on a dataset of a certain size
- online systems
- service’s response time (pretty random)
- better use percentiles (99.9% p999 important but hard to optimize)
- should be measured on client side
- send all requests independently
- head-of-line blocking: a small number of slow requests can block all subsequent ones on the server
- service’s response time (pretty random)
- batch processing
- Coping with load
- rethink your architecture on every order of magnitude load increase
- scale up or out
- scale elastically or manually
- system is highly specific at large scale
Twitter example
Twitter has two operations:
- post tweet (4.6k rps avg., 12k at peak, 2012)
- home timeline(300k rps)
The company uses a hybrid of two methods for their system.
- maintain a global collections of tweets
- post: insert into the collection
- home timeline: look up all people a user follows
- maintain a cache for each user’s timeline
- post: look up the followers and insert the tweet into their time line
- might be very expensive if too many followers
- use method 1 for celebrities
- might be very expensive if too many followers
- home timeline: cheap
- post: look up the followers and insert the tweet into their time line
Maintainability
Many people can work on it productively to maintain.
- Operability
- Make it easy for operations teams to keep the system running smoothly
- Having good visibility into the system’s health, and having effective ways of managing it
- Simplicity
- Make it easy for new engineers to understand the system
- Good abstractions can help reduce complexity
- Evolvability
- Make it easy for engineers to make changes to the system in the future
Data Models and Query Languages
Data models
- Relational model
- Query optimizer decides automatically the execution order, instead of to be handled by application codes
- better support for joins, many-to-one and many-to-many relations
- Document model
- necessary because of: greater scalability, free & open source, specialized queries, schema flexibility, better performance due to locality, closer to application data structures
- suitable for storing objects
- Joins are not needed for one-to-many tree structures, and support for joins is often weak.
- may need multiple queries in application code
Use which one?
- Document model
- good when application data has a document-like structure
- Shredding in relational model (split structure by column into multiple tables) is bad for this case.
- performs bad if refer a deeply nested item within a document
- poor if needs joins (can be done with multiple queries)
- good when application data has a document-like structure
- Relational model
- supports many-to-many relations
- Graph model
- highly interconnected data
- Schema flexibility
- schema on read, like dynamic type checking
- Data locality for queries
- A document is usually stored as a single continuous string.
- good if need to read entire or large part of a document
- recommend to keep the document size small
- on update, the document needs to be rewritten (unless same size before and after)
- Can be used by other model
- Google’s Spanner DB, nested with parent
- Oracle, multi-table index cluster tables
- column-family used by Cassandra and HBase
Relational DB and Document DB become more and more similar. A good route in the future may be a hybrid of both.
- Most relational DB (other than MySQL) support XML, similar to document DB. PostgreSQL, MySQL and IBM DB2 support JSON.
- RethinkDB and MongoDB support relational-like joins and resolve DB references automatically in their queries.
Query languages for data
- Imperative code
- tells the computer to perform certain operations in a certain order
- many programming languages
- Declarative code
- just specify the pattern of data you want, but not how to achieve that goal
- decided by query optimizer for SQL
- SQL, relational algebra; web: CSS
- possible to improve performance without changing queries
- parallel execution
- just specify the pattern of data you want, but not how to achieve that goal
MapReduce
- Process large amounts of data in bulk across many machines.
- Limited support by some NoSQL, like MongoDB (aggregation pipeline), CouchDB.
- A thing between imperative and declarative models.
Graph-like data model
- Good support on many-to-many relationships
- Vertices and edges
- types of vertices can be different
- Algorithms
- PageRank on web graph
- shortest path
- Not CODASYL or network model
Different types of graph models:
- Property graphs
- each vertex consists of
- unique ID
- outgoing edges
- incoming edges
- properties (k-v)
- each edge:
- unique ID
- start vertex
- end vertex
- label for the relationship between two vertices
- properties (k-v)
- each vertex consists of
- Triple-store model
- mostly equivalent to property graph
- all info stored in 3-part statements: (subject, predicate, object)
- e.g., (Jim, likes, bananas)
- subject is vertex
- object is either
- a string, number, thus (predicate, object) is a property
- or another vertex, thus an edge
- semantic web: the internet of machine-readable data
- RDF: Resource Description Framework
- a mechanism for different web-sites to publish data in a consistent format
- RDF: Resource Description Framework
Query languages:
- Cypher QL
- declarative QL for property graph
- Graph Queries in SQL
- query the graph with SQL if we put graph data in a relational structure
- difficulty: number of joins is not fixed in advance
- may need to traverse a number of edges to find the vertex
- e.g., query all people living in country 1:
a person -lives in-> street 1 -in-> city 1 -in-> state 1 -in-> country 1
- SPARQL
- a query language for triple-stores using the RDF data model
- Datalog: old foundation