Read "Bigtable, A Distributed Storage System for Structured Data"

Reference

Abstract

Distributed storage for managing structured data.
Scale to petabytes of data across thousands of servers.
Satisfying different client demands such as data size, latency etc.

1 Introduction

Goals:
- Wide applicability, scalability, high performance and high availability.
Simple data model, not a full relational data model.
- Dynamic control over data layout and format.
- Clients can reason about locality properties.
Indexed using row and column names.
Data are treated as uninterpreted strings.
- Clients can serialize various data types into them.
Clients can control whether to serve data out of memory or from disk.

2 Data model

Bigtable is a sparse, distributed, persistent multi-dimensional sorted map, indexed by a row key, column key and a timestamp. The value is an uninterpreted string. (row:string, column:string, time:int64) -> string

A sample piece of data for web page is as follows. Row key is url; column key is content, anchor:cnnsi.com, etc.

Data model example

2.1 Row

Size: up to 64 KB.
Each read/write under a single row key is atomic.
- No matter how many columns are involved.
Data are lexicographic ordered by row key.
Tablet: dynamically partitioned row range.
- Unit of distribution and load balancing.
Data access can gain good locality.
- i.e., use reversed URL.

2.2 Column families

Column families: grouped sets of column keys.
- Basic unit of access control: different user can read/write different groups individually.
- Basic unit of disk and memory accounting.
Data in a family are usually of the same type.
Number of column families should be small (up to hundreds) and rarely changed during operations.
Column key: family:qualifier
- In the example of web page data, anchor is a family.

2.3 Timestamps

Represents multiple versions of the same data.
Can be assigned by Bigtable or clients.
Decreasing ordered.
Versions can be garbage collected automatically.

3 API

Table-related
- Creates/Deletes tables and column families.
- Changes cluster, table and column family metadata, e.g., access control.
Values
- Writes/Deletes values.
- Looks up values from individual rows.
- Iterates over a subset of data in a table.
Other
- Single-row transactions.
  - e.g., read-modify-write.
- Batching writes across row keys.
- Cells can be used as integer counters.
- Execution of client-supplied scripts.
- Can be used with MapReduce, as inputs/outputs.

4 Building blocks

GFS: stores logs and data files.
Google SSTable is used to store data.
- Persistent, ordered, immutable map from key to values.
- A SSTable contains a sequence of blocks.
  - Block indexes are used to locate blocks. (Stored at the end of SSTable)
  - Data lookup: Load index into memory -> Find the block by binary search -> Read the block.
Chubby
- Highly available, persistent distributed lock service.
- Usages for Bigtable:
  - Ensure at most one active master.
  - Store bootstrap location of data.
  - Discover tablet servers and finalize tablet server deaths.
  - Store Bigtable schema information.
  - Store access control lists.

5 Implementation

3 major components: a library linked into every clients, one master server and many tablet servers.

Master is responsible for
- assigning tablets to tablet serves;
- detecting addition and deletion of tablet servers;
- load-balancing;
- garbage collection;
- schema changes.
Tablet server
- managers a set of tablets (10 - ~1000);
  - A tablet is a row range in the table, of size 100-200 MB by default.
- handles read and writes operations;
  - Clients talk with it directly, not via the master.
- splits tablets.

Tablet location

3-level hierarchy to store tablet location information. Locations are also cached on client library.

A Chubby file contains the location of a root tablet.
- Root tablet contains the locations of all tablets in a special METADATA table.
- Root tablet is the first tablet of METADATA table, but never split.
  - So no more than 3 levels.
METADATA table
- Contains locations of user tablets.
User tablets

Tablet assignment

Master keeps tracks of:

the set of live tablet servers
- Uses Chubby exclusive locks.
- If a tablet server is unavailable, the master can detect it and reassign its tablets.
  - By trying to acquire its lock.
current assignment of tablets to tablet servers, including unassigned ones
- Master needs to discover the assignments upon startup.
  - scans and communicates with all tablet servers.
  - scans METADATA table for all tablets.

Master initiates the creation, deletion of table and merging two tablets, thus keeps tracks of these changes regarding tablet assignments. Besides them, tablet split is initialized by tablet server. This change would be known by the master upon change committing or reading the split tablet.

Tablet serving

A commit log for redo records.
A memtable for recent updates.
A sequence of SSTable.

Operations:

Recover tablet: reads the indices of SSTables into memory and reconstructs the memtable by applying all updates since redo points.
Writes:
- Write to the commit log.
- Write to the memtable.
- Group commit is good for improving throughput of small mutations.
Reads: the server reads the merged view of the sequence of SSTables and memtable.

Compactions

Minor compaction
- Write the reach-threshold memtable into SSTable.
  - Reduce memory usage.
  - Reduce amount of data to read during recovery.
Merging compaction
- Merge a few SSTables and memtable into a new SSTable.
Major compaction
- Merge all SSTables into a new SSTable.
- Remove all deletion entries(tombstone) in old SSTables.
  - Tombstone of a record in a newer SSTable prevent the use of the record in a older SSTable.

Locality groups
- Groups multiple column families to use a separate SSTable, in a tablet.
  - Accessed together frequently.
- Can be declared to be in-memory.
Compression
- Can specify whether to compress and compression format for locality groups.
- Example compression format: 1st pass using Bently & McIlroy’s scheme; 2nd pass with a fast compression algorithm.
  - Often 10-to-1 reduction in space.
Caching for read performance
- Scan cache: higher-level caching the k-v pairs returned by the SSTable interface.
- Block cache: lower-level caching the SSTable blocks that were read from GFS.
Bloom filters
- Whether a SSTable might contain any data for a specified row/column pair.
Commit-log implementation
- Appends mutations to a single commit log per tablet server, for all tablets.
  - Avoids a large number of log file being written concurrently to GFS.
  - Recovery becomes complicated since tablets will be moved to many other tablet servers.
    - Master will initialize a sort of the log entries with respect with keys to tackle this.
- Tablet server will maintain two log writing threads. If one thread performs poorly, switches to the other one.
Speeding up tablet recovery
- The source tablet server with do minor compaction on tablets and the log to speed up.
Exploiting immutability
- All generated SSTables are immutable.
  - To delete data, use garbage collection on obsolete SSTable data.
  - Can split the tablet quickly. Child tablets share the SSTable of the parent tablet.
- Memtables are mutable; we use copy-on-write to reduce contentions during reads and reads and writes can be proceeded in parallel.

Performance evaluation

Performance doesn’t increase linearly with respect to the number of servers.
- There is a significant drop in per-server throughput.
  - Imbalance in load in multiple server configurations.
  - Network links are saturated.

Lessons

Systems are vulnerable to many types of failures, expected or unexpected.
Add new features until it’s clear how they would be used.
Proper system-level monitoring.
Value of simple design.

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 16

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 16

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 16

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1