Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems


All chapters

  1. Reliable, Scalable, and Maintainable Applications
  2. Data Models and Query Languages
  3. Storage and Retrieval
  4. Encoding and Evolution
  5. Replication
  6. Partitioning
  7. Transactions
  8. Distributed System Troubles
  9. Consistency and Consensus
  10. Batch Processing
  11. Stream Processing
  12. The Future of Data Systems

Reliable, Scalable, and Maintainable Applications


Work correctly even with hardware, software or human faults.

  • Hardware: adding redundancy; or using software fault-tolerance techniques (rolling upgrade)
  • Software: making wrong assumptions about its environment
    • Solution: rethinking; testing; process isolation; allowing processes to crash and restart; measuring; monitoring and analyzing
  • Human
    • Minimize opportunities for error, good API
    • Decouple the places where people make the most mistakes from the places where they can cause failures
    • Test thoroughly at all levels, unit, integration and manual
    • Allow quick and easy recovery from human errors
    • Set up detailed and clear monitoring, such as performance metrics and error rates
    • Good management practices and training


Deal with system growth.

  • System’s ability to cope with increased load
  • Describe the load
    • load parameters:
      • rps to a web server
      • ratio of reads to writes in a DB
      • number of active users in chat room
      • hit rate of a cache
      • etc.
  • Describe performance
    • batch processing
      • throughput (number of records processed per second)
      • total time to run a job on a dataset of a certain size
    • online systems
      • service’s response time (pretty random)
        • better use percentiles (99.9% p999 important but hard to optimize)
        • should be measured on client side
          • send all requests independently
        • head-of-line blocking: a small number of slow requests can block all subsequent ones on the server
  • Coping with load
    • rethink your architecture on every order of magnitude load increase
    • scale up or out
    • scale elastically or manually
    • system is highly specific at large scale

Twitter example

Twitter has two operations:

  1. post tweet (4.6k rps avg., 12k at peak, 2012)
  2. home timeline(300k rps)

The company uses a hybrid of two methods for their system.

  • maintain a global collections of tweets
    • post: insert into the collection
    • home timeline: look up all people a user follows
  • maintain a cache for each user’s timeline
    • post: look up the followers and insert the tweet into their time line
      • might be very expensive if too many followers
        • use method 1 for celebrities
    • home timeline: cheap


Many people can work on it productively to maintain.

  • Operability
    • Make it easy for operations teams to keep the system running smoothly
    • Having good visibility into the system’s health, and having effective ways of managing it
  • Simplicity
    • Make it easy for new engineers to understand the system
    • Good abstractions can help reduce complexity
  • Evolvability
    • Make it easy for engineers to make changes to the system in the future

Data Models and Query Languages

Data models

  • Relational model
    • Query optimizer decides automatically the execution order, instead of to be handled by application codes
    • better support for joins, many-to-one and many-to-many relations
  • Document model
    • necessary because of: greater scalability, free & open source, specialized queries, schema flexibility, better performance due to locality, closer to application data structures
    • suitable for storing objects
    • Joins are not needed for one-to-many tree structures, and support for joins is often weak.
      • may need multiple queries in application code

Use which one?

  • Document model
    • good when application data has a document-like structure
      • Shredding in relational model (split structure by column into multiple tables) is bad for this case.
    • performs bad if refer a deeply nested item within a document
    • poor if needs joins (can be done with multiple queries)
  • Relational model
    • supports many-to-many relations
  • Graph model
    • highly interconnected data
  • Schema flexibility
    • schema on read, like dynamic type checking
  • Data locality for queries
    • A document is usually stored as a single continuous string.
    • good if need to read entire or large part of a document
    • recommend to keep the document size small
      • on update, the document needs to be rewritten (unless same size before and after)
    • Can be used by other model
      • Google’s Spanner DB, nested with parent
      • Oracle, multi-table index cluster tables
      • column-family used by Cassandra and HBase

Relational DB and Document DB become more and more similar. A good route in the future may be a hybrid of both.

  • Most relational DB (other than MySQL) support XML, similar to document DB. PostgreSQL, MySQL and IBM DB2 support JSON.
  • RethinkDB and MongoDB support relational-like joins and resolve DB references automatically in their queries.

Query languages for data

  • Imperative code
    • tells the computer to perform certain operations in a certain order
    • many programming languages
  • Declarative code
    • just specify the pattern of data you want, but not how to achieve that goal
      • decided by query optimizer for SQL
    • SQL, relational algebra; web: CSS
    • possible to improve performance without changing queries
    • parallel execution


  • Process large amounts of data in bulk across many machines.
  • Limited support by some NoSQL, like MongoDB (aggregation pipeline), CouchDB.
  • A thing between imperative and declarative models.

Graph-like data model

  • Good support on many-to-many relationships
  • Vertices and edges
    • types of vertices can be different
  • Algorithms
    • PageRank on web graph
    • shortest path
  • Not CODASYL or network model

Different types of graph models:

  • Property graphs
    • each vertex consists of
      • unique ID
      • outgoing edges
      • incoming edges
      • properties (k-v)
    • each edge:
      • unique ID
      • start vertex
      • end vertex
      • label for the relationship between two vertices
      • properties (k-v)
  • Triple-store model
    • mostly equivalent to property graph
    • all info stored in 3-part statements: (subject, predicate, object)
      • e.g., (Jim, likes, bananas)
    • subject is vertex
    • object is either
      • a string, number, thus (predicate, object) is a property
      • or another vertex, thus an edge
    • semantic web: the internet of machine-readable data
      • RDF: Resource Description Framework
        • a mechanism for different web-sites to publish data in a consistent format

Query languages:

  • Cypher QL
    • declarative QL for property graph
  • Graph Queries in SQL
    • query the graph with SQL if we put graph data in a relational structure
    • difficulty: number of joins is not fixed in advance
      • may need to traverse a number of edges to find the vertex
      • e.g., query all people living in country 1: a person -lives in-> street 1 -in-> city 1 -in-> state 1 -in-> country 1
    • a query language for triple-stores using the RDF data model
  • Datalog: old foundation

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2019. All rights reserved by melonskin. Powered by Jekyll.