All chapters
- Reliable, Scalable, and Maintainable Applications
- Data Models and Query Languages
- Storage and Retrieval
- Encoding and Evolution
- Replication
- Partitioning
- Transactions
- Distributed System Troubles
- Consistency and Consensus
- Batch Processing
- Stream Processing
- The Future of Data Systems
Partitioning
For very large datasets or very high query throughput, we need to break the data up into partitions.
- A partition contains several records/rows.
- A large dataset can be distributed across many nodes.
- Query load can be distributed across many processors.
Synonyms:
- partition
- shard in MongoDB, Elasticsearch, SolrCloud
- region in HBase
- tablet in Bigtable
- vnode in Cassandra, Riak
- vBucket in Couchbase
Partitioning and replication
- Copies of one partition may be stored on multiple nodes.
- A node may store more than one partitions.
Partitioning of key-value data
- Skewed: The partition is unfair, so that some partitions have more data or queries than others.
- Hot spot: A partition with disproportionately high load.
Partitioning by key range
- Assign a continuous range of keys to each partition.
- Problem: The ranges of keys are not necessarily evenly spaced.
- Because the data may not be evenly distributed.
- Within each partition, keys are in sorted order.
- Easy to do range scans.
- Applications: Bigtable, HBase, RethinkDB, early MongoDB
- Downsides
- Certain access patterns can lead to hot spots.
- Heavy writes if using timestamp as key.
- All writes for today go to the same partition.
- Can be improve
- Use something other than the timestamp as the first element of the key.
- e.g., Device ID
- Use something other than the timestamp as the first element of the key.
- Heavy writes if using timestamp as key.
- Certain access patterns can lead to hot spots.
Partitioning by hash of key
- A good hash function takes skewed data and makes it uniformly distributed.
- The hash function need not to be cryptographically strong.
- MD5 by Cassandra, MongoDB
- Fowler-Noll-Vo function used by Volemort
- Built-in hash function in languages may not be good.
- The same key may have different hash values.
- Range queries are not supported well.
- Cassandra achieves a compromise between the two partitioning strategies.
- A table in Cassandra can be declared with a compound primary key consisting of several columns.
- Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
- Fix the first column value, it can perform an efficient range scan over the other columns of the key.
- Cassandra achieves a compromise between the two partitioning strategies.
Skewed workloads and relieving hot spots
- Unusual workload: a celebrity user with millions of followers may cause a storm of activities, where hash doesn’t help.
- Need to handle such skewed cases in application codes.
- If one key is very hot, add a random number to the beginning or end of the key.
- Read needs more work.
- Keep track of which keys are being split.
Partitioning and secondary indexes
- Secondary indexes
- Basic feature in relational database.
- Common in document databases too.
- HBase and Voldemort avoid it.
- Riak started adding it.
- Problem is secondary indexes don’t map neatly to partitions.
- Two approaches to partitioning with secondary indexes:
- Document-based
- Term-based
Partitioning secondary indexes by document
- Example: a car listing for sale website
- Primary key/document ID: car ID
- Secondary keys: color, make
- Partition the database by document ID.
- Local write
- Only need to deal with the partition with corresponding document ID.
- Need to query all partitions for reads.
- e.g., MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB
Partitioning secondary indexes by term
- Use a global index to cover data in all partitions.
- Like
color:red
- Must be partitioned across nodes too.
- Like
- Advantage
- Efficient reads
- Downside
- Complicated and slower writes
- A distributed transaction across partitions is needed for writes.
- Not supported in all databases.
- e.g., Dynamo, Riak’s search feature, Oracle data warehouse
Re-balancing partitions
- Re-balancing: moving load from one node to another.
- Requirements:
- After re-balancing, load should be fairly distributed.
- While re-balancing, database should continue accepting reads and writes.
- Only necessary data is moved to minimize network and disk I/O.
Strategies for re-balancing
- Avoid hash
mod N
- Make re-balancing expensive while changing
N
value
- Make re-balancing expensive while changing
- Fixed number of partitions
- Create many more partitions than nodes and assign several partitions to each node.
- e.g., 1000 partitions on 10 nodes
- To add a node, the new node steals a few partitions from every old node.
- Old partition assignment is used while the transfer is in progress.
- Force powerful nodes to take more load.
- e.g., Riak, Elasticsearch, Couchbase, Voldemort
- Number of partitions
N
is usually fixed for the entire process, thus simpler. - Choose the right
N
is hard whileN
is fixed and dataset size varies.- High enough to have small-size partitions and accommodate future growth.
- Not too high to cause much management overhead.
- Dynamic partitioning
- Key range-partitioned databases create partitions dynamically.
- Split large partitions and merge small partitions.
- Transfer one split sub-partition to another node to balance the load.
- e.g., HBase, RethinkDB
- Advantage:
- Number of partitions adapts to the total data volume, thus overhead is small.
- Caveat
- An initially empty database starts off with a single partition.
- All loads go to a single node before first splitting with all other nodes idle.
- Solution: Pre-splitting, HBase and MongoDB allow an initial set of partitions to be configured.
- Pre-splitting requires that you already know what the key distribution is going to look like.
- Can also support hash partitioning.
- Partitioning proportionally to nodes
- Make the number of partitions proportional to the number of nodes.
- Fixed number of partitions per node.
- When a new node joins, randomly choose a fixed number of existing partitions to split and take away one half and leave the other half in place.
- e.g., Cassandra, Ketama
- Make the number of partitions proportional to the number of nodes.
Operations: automatic or manual re-balancing
- Fully automated re-balancing can be convenient.
- But can be unpredictable and dangerous.
- Good to have a human in the loop.
Request routing
For a client, which node to connect to get the partition?
- An instance of the problem called service discovery.
- Approaches
- Clients can contact any node.
- The node owning the partition can handle the request.
- Otherwise, forward the request to the appropriate node; receive the reply and pass the reply to the client.
- Send all client requests to a routing tier first.
- Require clients to be aware of the assignment of partitions.
- Clients can contact any node.
- Difficulty: How does the routing component learn about changes in partition assignment?
- All participants must agree on the new changes.
- Consensus in a distributed system, hard to implement.
- All participants must agree on the new changes.
- Many distributed data systems rely on a separate coordination service such as ZooKeeper.
- Keep track of this cluster metadata.
- Nodes register themselves in ZooKeeper.
- ZooKeeper maintains the authoritative mapping of partitions to nodes.
- Other components can subscribe information in ZooKeeper.
- Whenever assignment changes, ZooKeeper notifies those components.
- e,g.
- Espresso uses Helix (on ZooKeeper).
- HBase, SolrCloud and Kafka use ZooKeeper.
- MongoDB uses its own service.
- Cassandra and Riak use gossip protocol.
- Use DNS to find the IP address of routing tier or random nodes.
Parallel query execution
- Massively parallel processing (MPP) uses much more complex queries comparing with single-key queries.
- MPP query can be broken into a number of executions and partitions, which may be executed in parallel on different nodes.
- By MPP query optimizer.