Read "Designing Data-intensive applications" 6, Partitioning

All chapters

Partitioning

For very large datasets or very high query throughput, we need to break the data up into partitions.

A partition contains several records/rows.
A large dataset can be distributed across many nodes.
Query load can be distributed across many processors.

Synonyms:

partition
shard in MongoDB, Elasticsearch, SolrCloud
region in HBase
tablet in Bigtable
vnode in Cassandra, Riak
vBucket in Couchbase

Partitioning and replication

Copies of one partition may be stored on multiple nodes.
A node may store more than one partitions.

Partitioning of key-value data

Skewed: The partition is unfair, so that some partitions have more data or queries than others.
Hot spot: A partition with disproportionately high load.

Partitioning by key range

Assign a continuous range of keys to each partition.
Problem: The ranges of keys are not necessarily evenly spaced.
- Because the data may not be evenly distributed.
Within each partition, keys are in sorted order.
- Easy to do range scans.
Applications: Bigtable, HBase, RethinkDB, early MongoDB
Downsides
- Certain access patterns can lead to hot spots.
  - Heavy writes if using timestamp as key.
    - All writes for today go to the same partition.
  - Can be improve
    - Use something other than the timestamp as the first element of the key.
      - e.g., Device ID

Partitioning by hash of key

A good hash function takes skewed data and makes it uniformly distributed.
The hash function need not to be cryptographically strong.
- MD5 by Cassandra, MongoDB
- Fowler-Noll-Vo function used by Volemort
- Built-in hash function in languages may not be good.
  - The same key may have different hash values.
Range queries are not supported well.
- Cassandra achieves a compromise between the two partitioning strategies.
  - A table in Cassandra can be declared with a compound primary key consisting of several columns.
  - Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
  - Fix the first column value, it can perform an efficient range scan over the other columns of the key.

Skewed workloads and relieving hot spots

Unusual workload: a celebrity user with millions of followers may cause a storm of activities, where hash doesn’t help.
Need to handle such skewed cases in application codes.
- If one key is very hot, add a random number to the beginning or end of the key.
- Read needs more work.
- Keep track of which keys are being split.

Partitioning and secondary indexes

Secondary indexes
- Basic feature in relational database.
- Common in document databases too.
  - HBase and Voldemort avoid it.
  - Riak started adding it.
Problem is secondary indexes don’t map neatly to partitions.
Two approaches to partitioning with secondary indexes:
- Document-based
- Term-based

Partitioning secondary indexes by document

Example: a car listing for sale website
- Primary key/document ID: car ID
- Secondary keys: color, make
Partition the database by document ID.
Local write
- Only need to deal with the partition with corresponding document ID.
Need to query all partitions for reads.
e.g., MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB

Partitioning secondary indexes by term

Use a global index to cover data in all partitions.
- Like color:red
- Must be partitioned across nodes too.
Advantage
- Efficient reads
Downside
- Complicated and slower writes
- A distributed transaction across partitions is needed for writes.
  - Not supported in all databases.
e.g., Dynamo, Riak’s search feature, Oracle data warehouse

Re-balancing partitions

Re-balancing: moving load from one node to another.
Requirements:
- After re-balancing, load should be fairly distributed.
- While re-balancing, database should continue accepting reads and writes.
- Only necessary data is moved to minimize network and disk I/O.

Strategies for re-balancing

Avoid hash mod N
- Make re-balancing expensive while changing N value
Fixed number of partitions
- Create many more partitions than nodes and assign several partitions to each node.
- e.g., 1000 partitions on 10 nodes
- To add a node, the new node steals a few partitions from every old node.
- Old partition assignment is used while the transfer is in progress.
- Force powerful nodes to take more load.
- e.g., Riak, Elasticsearch, Couchbase, Voldemort
- Number of partitions N is usually fixed for the entire process, thus simpler.
- Choose the right N is hard while N is fixed and dataset size varies.
  - High enough to have small-size partitions and accommodate future growth.
  - Not too high to cause much management overhead.
Dynamic partitioning
- Key range-partitioned databases create partitions dynamically.
- Split large partitions and merge small partitions.
- Transfer one split sub-partition to another node to balance the load.
- e.g., HBase, RethinkDB
- Advantage:
  - Number of partitions adapts to the total data volume, thus overhead is small.
- Caveat
  - An initially empty database starts off with a single partition.
  - All loads go to a single node before first splitting with all other nodes idle.
  - Solution: Pre-splitting, HBase and MongoDB allow an initial set of partitions to be configured.
    - Pre-splitting requires that you already know what the key distribution is going to look like.
- Can also support hash partitioning.
Partitioning proportionally to nodes
- Make the number of partitions proportional to the number of nodes.
  - Fixed number of partitions per node.
- When a new node joins, randomly choose a fixed number of existing partitions to split and take away one half and leave the other half in place.
- e.g., Cassandra, Ketama

Operations: automatic or manual re-balancing

Fully automated re-balancing can be convenient.
- But can be unpredictable and dangerous.
Good to have a human in the loop.

Request routing

For a client, which node to connect to get the partition?

An instance of the problem called service discovery.
Approaches
- Clients can contact any node.
  - The node owning the partition can handle the request.
  - Otherwise, forward the request to the appropriate node; receive the reply and pass the reply to the client.
- Send all client requests to a routing tier first.
- Require clients to be aware of the assignment of partitions.
Difficulty: How does the routing component learn about changes in partition assignment?
- All participants must agree on the new changes.
  - Consensus in a distributed system, hard to implement.
Many distributed data systems rely on a separate coordination service such as ZooKeeper.
- Keep track of this cluster metadata.
- Nodes register themselves in ZooKeeper.
- ZooKeeper maintains the authoritative mapping of partitions to nodes.
- Other components can subscribe information in ZooKeeper.
- Whenever assignment changes, ZooKeeper notifies those components.
- e,g.
  - Espresso uses Helix (on ZooKeeper).
  - HBase, SolrCloud and Kafka use ZooKeeper.
  - MongoDB uses its own service.
  - Cassandra and Riak use gossip protocol.
Use DNS to find the IP address of routing tier or random nodes.

Parallel query execution

Massively parallel processing (MPP) uses much more complex queries comparing with single-key queries.
MPP query can be broken into a number of executions and partitions, which may be executed in parallel on different nodes.
- By MPP query optimizer.

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1

Read "Designing Data-intensive applications" 6, Partitioning

All chapters

Partitioning

Partitioning and replication

Partitioning of key-value data

Partitioning by key range

Partitioning by hash of key

Skewed workloads and relieving hot spots

Partitioning and secondary indexes

Partitioning secondary indexes by document

Partitioning secondary indexes by term

Re-balancing partitions

Strategies for re-balancing

Operations: automatic or manual re-balancing

Request routing

Parallel query execution