Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Read "Bigtable, A Distributed Storage System for Structured Data"



Link to paper


  • Distributed storage for managing structured data.
  • Scale to petabytes of data across thousands of servers.
  • Satisfying different client demands such as data size, latency etc.

1 Introduction

  • Goals:
    • Wide applicability, scalability, high performance and high availability.
  • Simple data model, not a full relational data model.
    • Dynamic control over data layout and format.
    • Clients can reason about locality properties.
  • Indexed using row and column names.
  • Data are treated as uninterpreted strings.
    • Clients can serialize various data types into them.
  • Clients can control whether to serve data out of memory or from disk.

2 Data model

Bigtable is a sparse, distributed, persistent multi-dimensional sorted map, indexed by a row key, column key and a timestamp. The value is an uninterpreted string. (row:string, column:string, time:int64) -> string

A sample piece of data for web page is as follows. Row key is url; column key is content,, etc.

Data model example

2.1 Row

  • Size: up to 64 KB.
  • Each read/write under a single row key is atomic.
    • No matter how many columns are involved.
  • Data are lexicographic ordered by row key.
  • Tablet: dynamically partitioned row range.
    • Unit of distribution and load balancing.
  • Data access can gain good locality.
    • i.e., use reversed URL.

2.2 Column families

  • Column families: grouped sets of column keys.
    • Basic unit of access control: different user can read/write different groups individually.
    • Basic unit of disk and memory accounting.
  • Data in a family are usually of the same type.
  • Number of column families should be small (up to hundreds) and rarely changed during operations.
  • Column key: family:qualifier
    • In the example of web page data, anchor is a family.

2.3 Timestamps

  • Represents multiple versions of the same data.
  • Can be assigned by Bigtable or clients.
  • Decreasing ordered.
  • Versions can be garbage collected automatically.


  • Table-related
    • Creates/Deletes tables and column families.
    • Changes cluster, table and column family metadata, e.g., access control.
  • Values
    • Writes/Deletes values.
    • Looks up values from individual rows.
    • Iterates over a subset of data in a table.
  • Other
    • Single-row transactions.
      • e.g., read-modify-write.
    • Batching writes across row keys.
    • Cells can be used as integer counters.
    • Execution of client-supplied scripts.
    • Can be used with MapReduce, as inputs/outputs.

4 Building blocks

  • GFS: stores logs and data files.
  • Google SSTable is used to store data.
    • Persistent, ordered, immutable map from key to values.
    • A SSTable contains a sequence of blocks.
      • Block indexes are used to locate blocks. (Stored at the end of SSTable)
      • Data lookup: Load index into memory -> Find the block by binary search -> Read the block.
  • Chubby
    • Highly available, persistent distributed lock service.
    • Usages for Bigtable:
      • Ensure at most one active master.
      • Store bootstrap location of data.
      • Discover tablet servers and finalize tablet server deaths.
      • Store Bigtable schema information.
      • Store access control lists.

5 Implementation

3 major components: a library linked into every clients, one master server and many tablet servers.

  • Master is responsible for
    • assigning tablets to tablet serves;
    • detecting addition and deletion of tablet servers;
    • load-balancing;
    • garbage collection;
    • schema changes.
  • Tablet server
    • managers a set of tablets (10 - ~1000);
      • A tablet is a row range in the table, of size 100-200 MB by default.
    • handles read and writes operations;
      • Clients talk with it directly, not via the master.
    • splits tablets.

Tablet location

3-level hierarchy to store tablet location information. Locations are also cached on client library.

  1. A Chubby file contains the location of a root tablet.
    • Root tablet contains the locations of all tablets in a special METADATA table.
    • Root tablet is the first tablet of METADATA table, but never split.
      • So no more than 3 levels.
  2. METADATA table
    • Contains locations of user tablets.
  3. User tablets

Tablet assignment

Master keeps tracks of:

  • the set of live tablet servers
    • Uses Chubby exclusive locks.
    • If a tablet server is unavailable, the master can detect it and reassign its tablets.
      • By trying to acquire its lock.
  • current assignment of tablets to tablet servers, including unassigned ones
    • Master needs to discover the assignments upon startup.
      • scans and communicates with all tablet servers.
      • scans METADATA table for all tablets.

Master initiates the creation, deletion of table and merging two tablets, thus keeps tracks of these changes regarding tablet assignments. Besides them, tablet split is initialized by tablet server. This change would be known by the master upon change committing or reading the split tablet.

Tablet serving

  • A commit log for redo records.
  • A memtable for recent updates.
  • A sequence of SSTable.


  • Recover tablet: reads the indices of SSTables into memory and reconstructs the memtable by applying all updates since redo points.
  • Writes:
    • Write to the commit log.
    • Write to the memtable.
    • Group commit is good for improving throughput of small mutations.
  • Reads: the server reads the merged view of the sequence of SSTables and memtable.


  • Minor compaction
    • Write the reach-threshold memtable into SSTable.
      • Reduce memory usage.
      • Reduce amount of data to read during recovery.
  • Merging compaction
    • Merge a few SSTables and memtable into a new SSTable.
  • Major compaction
    • Merge all SSTables into a new SSTable.
    • Remove all deletion entries(tombstone) in old SSTables.
      • Tombstone of a record in a newer SSTable prevent the use of the record in a older SSTable.


  • Locality groups
    • Groups multiple column families to use a separate SSTable, in a tablet.
      • Accessed together frequently.
    • Can be declared to be in-memory.
  • Compression
    • Can specify whether to compress and compression format for locality groups.
    • Example compression format: 1st pass using Bently & McIlroy’s scheme; 2nd pass with a fast compression algorithm.
      • Often 10-to-1 reduction in space.
  • Caching for read performance
    • Scan cache: higher-level caching the k-v pairs returned by the SSTable interface.
    • Block cache: lower-level caching the SSTable blocks that were read from GFS.
  • Bloom filters
    • Whether a SSTable might contain any data for a specified row/column pair.
  • Commit-log implementation
    • Appends mutations to a single commit log per tablet server, for all tablets.
      • Avoids a large number of log file being written concurrently to GFS.
      • Recovery becomes complicated since tablets will be moved to many other tablet servers.
        • Master will initialize a sort of the log entries with respect with keys to tackle this.
    • Tablet server will maintain two log writing threads. If one thread performs poorly, switches to the other one.
  • Speeding up tablet recovery
    • The source tablet server with do minor compaction on tablets and the log to speed up.
  • Exploiting immutability
    • All generated SSTables are immutable.
      • To delete data, use garbage collection on obsolete SSTable data.
      • Can split the tablet quickly. Child tablets share the SSTable of the parent tablet.
    • Memtables are mutable; we use copy-on-write to reduce contentions during reads and reads and writes can be proceeded in parallel.

Performance evaluation

  • Performance doesn’t increase linearly with respect to the number of servers.
    • There is a significant drop in per-server throughput.
      • Imbalance in load in multiple server configurations.
      • Network links are saturated.


  • Systems are vulnerable to many types of failures, expected or unexpected.
  • Add new features until it’s clear how they would be used.
  • Proper system-level monitoring.
  • Value of simple design.

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2020. All rights reserved by melonskin. Powered by Jekyll.