2018-10-09

## Partitioning

For very large datasets or very high query throughput, we need to break the data up into partitions.

• A partition contains several records/rows.
• A large dataset can be distributed across many nodes.
• Query load can be distributed across many processors.

Synonyms:

• partition
• shard in MongoDB, Elasticsearch, SolrCloud
• region in HBase
• tablet in Bigtable
• vnode in Cassandra, Riak
• vBucket in Couchbase

### Partitioning and replication

• Copies of one partition may be stored on multiple nodes.
• A node may store more than one partitions.

### Partitioning of key-value data

• Skewed: The partition is unfair, so that some partitions have more data or queries than others.
• Hot spot: A partition with disproportionately high load.

#### Partitioning by key range

• Assign a continuous range of keys to each partition.
• Problem: The ranges of keys are not necessarily evenly spaced.
• Because the data may not be evenly distributed.
• Within each partition, keys are in sorted order.
• Easy to do range scans.
• Applications: Bigtable, HBase, RethinkDB, early MongoDB
• Downsides
• Certain access patterns can lead to hot spots.
• Heavy writes if using timestamp as key.
• All writes for today go to the same partition.
• Can be improve
• Use something other than the timestamp as the first element of the key.
• e.g., Device ID

#### Partitioning by hash of key

• A good hash function takes skewed data and makes it uniformly distributed.
• The hash function need not to be cryptographically strong.
• MD5 by Cassandra, MongoDB
• Fowler-Noll-Vo function used by Volemort
• Built-in hash function in languages may not be good.
• The same key may have different hash values.
• Range queries are not supported well.
• Cassandra achieves a compromise between the two partitioning strategies.
• A table in Cassandra can be declared with a compound primary key consisting of several columns.
• Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
• Fix the first column value, it can perform an efficient range scan over the other columns of the key.

#### Skewed workloads and relieving hot spots

• Unusual workload: a celebrity user with millions of followers may cause a storm of activities, where hash doesn’t help.
• Need to handle such skewed cases in application codes.
• If one key is very hot, add a random number to the beginning or end of the key.
• Keep track of which keys are being split.

### Partitioning and secondary indexes

• Secondary indexes
• Basic feature in relational database.
• Common in document databases too.
• HBase and Voldemort avoid it.
• Problem is secondary indexes don’t map neatly to partitions.
• Two approaches to partitioning with secondary indexes:
• Document-based
• Term-based

#### Partitioning secondary indexes by document

• Example: a car listing for sale website
• Primary key/document ID: car ID
• Secondary keys: color, make
• Partition the database by document ID.
• Local write
• Only need to deal with the partition with corresponding document ID.
• Need to query all partitions for reads.
• e.g., MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB

#### Partitioning secondary indexes by term

• Use a global index to cover data in all partitions.
• Like color:red
• Must be partitioned across nodes too.
• Downside
• Complicated and slower writes
• A distributed transaction across partitions is needed for writes.
• Not supported in all databases.
• e.g., Dynamo, Riak’s search feature, Oracle data warehouse

### Re-balancing partitions

• Re-balancing: moving load from one node to another.
• Requirements:
• After re-balancing, load should be fairly distributed.
• While re-balancing, database should continue accepting reads and writes.
• Only necessary data is moved to minimize network and disk I/O.

#### Strategies for re-balancing

• Avoid hash mod N
• Make re-balancing expensive while changing N value
• Fixed number of partitions
• Create many more partitions than nodes and assign several partitions to each node.
• e.g., 1000 partitions on 10 nodes
• To add a node, the new node steals a few partitions from every old node.
• Old partition assignment is used while the transfer is in progress.
• Force powerful nodes to take more load.
• e.g., Riak, Elasticsearch, Couchbase, Voldemort
• Number of partitions N is usually fixed for the entire process, thus simpler.
• Choose the right N is hard while N is fixed and dataset size varies.
• High enough to have small-size partitions and accommodate future growth.
• Not too high to cause much management overhead.
• Dynamic partitioning
• Key range-partitioned databases create partitions dynamically.
• Split large partitions and merge small partitions.
• Transfer one split sub-partition to another node to balance the load.
• e.g., HBase, RethinkDB
• Number of partitions adapts to the total data volume, thus overhead is small.
• Caveat
• An initially empty database starts off with a single partition.
• All loads go to a single node before first splitting with all other nodes idle.
• Solution: Pre-splitting, HBase and MongoDB allow an initial set of partitions to be configured.
• Pre-splitting requires that you already know what the key distribution is going to look like.
• Can also support hash partitioning.
• Partitioning proportionally to nodes
• Make the number of partitions proportional to the number of nodes.
• Fixed number of partitions per node.
• When a new node joins, randomly choose a fixed number of existing partitions to split and take away one half and leave the other half in place.
• e.g., Cassandra, Ketama

#### Operations: automatic or manual re-balancing

• Fully automated re-balancing can be convenient.
• But can be unpredictable and dangerous.
• Good to have a human in the loop.

### Request routing

For a client, which node to connect to get the partition?

• An instance of the problem called service discovery.
• Approaches
• Clients can contact any node.
• The node owning the partition can handle the request.
• Otherwise, forward the request to the appropriate node; receive the reply and pass the reply to the client.
• Send all client requests to a routing tier first.
• Require clients to be aware of the assignment of partitions.
• Difficulty: How does the routing component learn about changes in partition assignment?
• All participants must agree on the new changes.
• Consensus in a distributed system, hard to implement.
• Many distributed data systems rely on a separate coordination service such as ZooKeeper.
• Keep track of this cluster metadata.
• Nodes register themselves in ZooKeeper.
• ZooKeeper maintains the authoritative mapping of partitions to nodes.
• Other components can subscribe information in ZooKeeper.
• Whenever assignment changes, ZooKeeper notifies those components.
• e,g.
• Espresso uses Helix (on ZooKeeper).
• HBase, SolrCloud and Kafka use ZooKeeper.
• MongoDB uses its own service.
• Cassandra and Riak use gossip protocol.
• Use DNS to find the IP address of routing tier or random nodes.

#### Parallel query execution

• Massively parallel processing (MPP) uses much more complex queries comparing with single-key queries.
• MPP query can be broken into a number of executions and partitions, which may be executed in parallel on different nodes.
• By MPP query optimizer.