Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Read "Designing Data-intensive applications" 6, Partitioning


All chapters

  1. Reliable, Scalable, and Maintainable Applications
  2. Data Models and Query Languages
  3. Storage and Retrieval
  4. Encoding and Evolution
  5. Replication
  6. Partitioning
  7. Transactions
  8. Distributed System Troubles
  9. Consistency and Consensus
  10. Batch Processing
  11. Stream Processing
  12. The Future of Data Systems


For very large datasets or very high query throughput, we need to break the data up into partitions.

  • A partition contains several records/rows.
  • A large dataset can be distributed across many nodes.
  • Query load can be distributed across many processors.


  • partition
  • shard in MongoDB, Elasticsearch, SolrCloud
  • region in HBase
  • tablet in Bigtable
  • vnode in Cassandra, Riak
  • vBucket in Couchbase

Partitioning and replication

  • Copies of one partition may be stored on multiple nodes.
  • A node may store more than one partitions.

Partitioning of key-value data

  • Skewed: The partition is unfair, so that some partitions have more data or queries than others.
  • Hot spot: A partition with disproportionately high load.

Partitioning by key range

  • Assign a continuous range of keys to each partition.
  • Problem: The ranges of keys are not necessarily evenly spaced.
    • Because the data may not be evenly distributed.
  • Within each partition, keys are in sorted order.
    • Easy to do range scans.
  • Applications: Bigtable, HBase, RethinkDB, early MongoDB
  • Downsides
    • Certain access patterns can lead to hot spots.
      • Heavy writes if using timestamp as key.
        • All writes for today go to the same partition.
      • Can be improve
        • Use something other than the timestamp as the first element of the key.
          • e.g., Device ID

Partitioning by hash of key

  • A good hash function takes skewed data and makes it uniformly distributed.
  • The hash function need not to be cryptographically strong.
    • MD5 by Cassandra, MongoDB
    • Fowler-Noll-Vo function used by Volemort
    • Built-in hash function in languages may not be good.
      • The same key may have different hash values.
  • Range queries are not supported well.
    • Cassandra achieves a compromise between the two partitioning strategies.
      • A table in Cassandra can be declared with a compound primary key consisting of several columns.
      • Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
      • Fix the first column value, it can perform an efficient range scan over the other columns of the key.

Skewed workloads and relieving hot spots

  • Unusual workload: a celebrity user with millions of followers may cause a storm of activities, where hash doesn’t help.
  • Need to handle such skewed cases in application codes.
    • If one key is very hot, add a random number to the beginning or end of the key.
    • Read needs more work.
    • Keep track of which keys are being split.

Partitioning and secondary indexes

  • Secondary indexes
    • Basic feature in relational database.
    • Common in document databases too.
      • HBase and Voldemort avoid it.
      • Riak started adding it.
  • Problem is secondary indexes don’t map neatly to partitions.
  • Two approaches to partitioning with secondary indexes:
    • Document-based
    • Term-based

Partitioning secondary indexes by document

  • Example: a car listing for sale website
    • Primary key/document ID: car ID
    • Secondary keys: color, make
  • Partition the database by document ID.
  • Local write
    • Only need to deal with the partition with corresponding document ID.
  • Need to query all partitions for reads.
  • e.g., MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB

Partitioning secondary indexes by term

  • Use a global index to cover data in all partitions.
    • Like color:red
    • Must be partitioned across nodes too.
  • Advantage
    • Efficient reads
  • Downside
    • Complicated and slower writes
    • A distributed transaction across partitions is needed for writes.
      • Not supported in all databases.
  • e.g., Dynamo, Riak’s search feature, Oracle data warehouse

Re-balancing partitions

  • Re-balancing: moving load from one node to another.
  • Requirements:
    • After re-balancing, load should be fairly distributed.
    • While re-balancing, database should continue accepting reads and writes.
    • Only necessary data is moved to minimize network and disk I/O.

Strategies for re-balancing

  • Avoid hash mod N
    • Make re-balancing expensive while changing N value
  • Fixed number of partitions
    • Create many more partitions than nodes and assign several partitions to each node.
    • e.g., 1000 partitions on 10 nodes
    • To add a node, the new node steals a few partitions from every old node.
    • Old partition assignment is used while the transfer is in progress.
    • Force powerful nodes to take more load.
    • e.g., Riak, Elasticsearch, Couchbase, Voldemort
    • Number of partitions N is usually fixed for the entire process, thus simpler.
    • Choose the right N is hard while N is fixed and dataset size varies.
      • High enough to have small-size partitions and accommodate future growth.
      • Not too high to cause much management overhead.
  • Dynamic partitioning
    • Key range-partitioned databases create partitions dynamically.
    • Split large partitions and merge small partitions.
    • Transfer one split sub-partition to another node to balance the load.
    • e.g., HBase, RethinkDB
    • Advantage:
      • Number of partitions adapts to the total data volume, thus overhead is small.
    • Caveat
      • An initially empty database starts off with a single partition.
      • All loads go to a single node before first splitting with all other nodes idle.
      • Solution: Pre-splitting, HBase and MongoDB allow an initial set of partitions to be configured.
        • Pre-splitting requires that you already know what the key distribution is going to look like.
    • Can also support hash partitioning.
  • Partitioning proportionally to nodes
    • Make the number of partitions proportional to the number of nodes.
      • Fixed number of partitions per node.
    • When a new node joins, randomly choose a fixed number of existing partitions to split and take away one half and leave the other half in place.
    • e.g., Cassandra, Ketama

Operations: automatic or manual re-balancing

  • Fully automated re-balancing can be convenient.
    • But can be unpredictable and dangerous.
  • Good to have a human in the loop.

Request routing

For a client, which node to connect to get the partition?

  • An instance of the problem called service discovery.
  • Approaches
    • Clients can contact any node.
      • The node owning the partition can handle the request.
      • Otherwise, forward the request to the appropriate node; receive the reply and pass the reply to the client.
    • Send all client requests to a routing tier first.
    • Require clients to be aware of the assignment of partitions.
  • Difficulty: How does the routing component learn about changes in partition assignment?
    • All participants must agree on the new changes.
      • Consensus in a distributed system, hard to implement.
  • Many distributed data systems rely on a separate coordination service such as ZooKeeper.
    • Keep track of this cluster metadata.
    • Nodes register themselves in ZooKeeper.
    • ZooKeeper maintains the authoritative mapping of partitions to nodes.
    • Other components can subscribe information in ZooKeeper.
    • Whenever assignment changes, ZooKeeper notifies those components.
    • e,g.
      • Espresso uses Helix (on ZooKeeper).
      • HBase, SolrCloud and Kafka use ZooKeeper.
      • MongoDB uses its own service.
      • Cassandra and Riak use gossip protocol.
  • Use DNS to find the IP address of routing tier or random nodes.

Parallel query execution

  • Massively parallel processing (MPP) uses much more complex queries comparing with single-key queries.
  • MPP query can be broken into a number of executions and partitions, which may be executed in parallel on different nodes.
    • By MPP query optimizer.

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2020. All rights reserved by melonskin. Powered by Jekyll.