2018-10-03

## Storage and Retrieval

### Data structures that power your DB

Indexing:

• An index is an additional structure that is derived from the primary data.
• An important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes.

#### Hash indexes

For the case of key-value storage

• In-memory hash table to store key and the record’s byte offset in the data file
• Good for situations where value for each key is updated frequently
• A lot of writes
• Not too many distinct key
• To avoid running out of disk space, break the log into segments of a certain size
• Compaction: throw away duplicate keys, keep only the most recent one
• Merge: merge smaller segments
• Keep number of segments small
• Both can be done in a background thread, and it won’t affect reads and writes
• Each segment maintains its own hash table

Implementation issues:

• File format
• Just use a binary format, first encode the length of a string in bytes, followed by the raw string
• Deleting records
• Append a special deletion record to the data file and to be handled during merging
• Crash recovery
• In-memory hash tables is lost during a crash
• Store a snapshot of each segment’s hash map on disk to avoid reading entire dataset
• Partially written records
• DB may crash while writing a record
• Include checksums in files to detect corrupted parts
• Concurrency control
• Use only one write thread to keep sequential order

Append-only log:

• Appending and segment merging are sequential write operations, which are generally much faster than random writes, on HDD or somehow SSD.
• Concurrency and crash recovery are much simpler if segment files are append-only or immutable.
• Merging old segments avoids the problem of data files getting fragmented over time.

Limitations:

• What if the memory cannot hold the hash table?
• Slow if we put the hash table on disk
• Range queries are not efficient.

#### SSTables and LSM-Trees

SSTable stands for Sorted String Table. Two changes comparing with hash table.

• The sequence of k-v pairs is sorted by key.
• Unique key within each merged segment file. (ensured by the compaction)

Advantages comparing with log segments with hash table:

• Support range queries pretty well
• Works well while DB size is much larger than memory
• High write throughput because of sequential writes
• Merging segments is simple and efficient, even if the files are bigger than the available memory.
• mergesort
• Don’t need to keep all keys in memory (can be sparse, like kBs), just find a key range in memory including the key to be searched and linear scan the range

Constructing and maintaining SSTables:

1. For a write, add it to an in-memory balanced tree (e.g., red-black tree), called a memtable.
2. If a memtable is bigger than a threshold (a few MB), write it out the disk as an SSTable.
3. For a read, find the key in current memtable, then most recent on-disk segment, then next-older one, etc.
4. merge and compaction in the background

SSTable’s problem is that we may lose data in current memtable. To solve it, we keep a separate log on disk to append every write immediately. Discard records while putting memtable into SSTable.

Originally this indexing structure was described by Patrick O’Neil et al. under the name Log-Structured Merge-Tree (or LSM-Tree).

Optimization:

• For non-existing keys, look up can be slow because we need to look up every segment
• Use bloom filters (a memory-efficient data structure for approximating the contents of a set)
• We can know if a key appear in the DB.
• Different strategies to determine the order and timing of how SSTables are compacted and merged.

#### B-Trees

Most widely used indexing structure. k-v pairs sorted by key, similar to LSM Tree. It supports efficient key lookups and range queries.

Implementation: Check out this note

• Break the DB down into fixed-size blocks or pages (traditionally 4 KB), and read or write one page at a time.
• Each page can be identified using an address (on disk), which can be used to refer another page.
• A tree of pages
• Update, insert, delete records
• May need to split or merge nodes
• $O(\log(n))$ operation
• Most DB can fit into 3 or 4-level tree
• 4-level of 4 KB pages with branching factor of 500 stores up to 256 TB

Reliability

• Write operation is usually to overwrite a page, which is not sequential.
• A lot of overhead if we consider disk drive operations (move head, etc)
• Several different pages may need to be overwritten, thus dangerous if crash happens.
• Solution is to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).
• append-only
• must be written before doing anything to the tree
• Concurrency control:
• Without it, a thread may see the tree in an inconsistent state.
• Typically done by protecting the tree’s data structures with latches (lightweight locks).

#### Comparing B-Trees and LSM-Trees

• Generally, LSM-trees are typically faster for writes
• B-trees are thought to be faster for reads
• May need to check on specific application and data

• Write a piece of data only once (what about writing to the log?)
• B-tree writes twice, the tree page and write-ahead log.
• LSM-tree may need to rewrite data multiple times due to compaction and merging.
• write amplification, harmful to SSD.
• may be a performance bottleneck due to limited disk bandwidth, in the case of heavy writes.
• Sustain higher write throughput than B-trees
• sometimes has lower write amplification.
• sequential writes instead of random.
• Can be compressed better, thus often smaller files on disk than B-trees.
• inner fragmentation in B-trees
• SSD flavors lower write amplification and reduced fragmentation

• The compaction process can sometimes interfere with the performance of ongoing reads and writes.
• Disks have limited resources
• The response time of queries to log-structured storage engines can sometimes be quite high.
• At high write throughput, the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.
• Compaction cannot keep up with the rate of incoming writes.
• more and more segments until using up disk
• slow reads because of more segments to check
• Unique key in B-tree, maybe multiple copies in LSM-tree
• Unique key supports strong transactional semantics. We can directly attach locks to the tree
• In many relational databases, transaction isolation is implemented using locks on ranges of keys

#### Other indexing structures

We can create secondary index for a record to query joins efficiently. The keys are not unique. Either we can have a index value contains a list of matching row ID, or make the value unique by attaching the row ID.

• The value of the index can refer to a place storing a heap file (a row).
• Good if we have multiple indexing, referring to the same file
• But for read performance, we can also store actual data in the index (clustered index)
• Primary key in InnoDB, 2nd index refers to the primary key
• May be inconsistent
• A compromise
• Covering index or index with included columns
• Stores some of a table’s columns within the index

Storage for other types (other than k-v)

• Multi-column indexes
• Concatenated index
• Combine several fields into one key
• Multi-dimensional indexes
• e.g., 2D locations; date and temperature
• R-trees, HyperDex
• Full-text search and fuzzy indexes
• SSTable-like structure for term dict
• A small in-memory index that tells queries at which offset to look for a key, similar to trie
• Search text within a certain edit distance
• Keeping everything in memory
• Cheaper RAM now
• Caching: memcached, ok if losing everything
• Durability:
• Special hardware like battery-powered RAM
• Writes changes, periodic snapshots into disk
• Better performance because we don’t need to encode in-memory data into disk format。
• Not because we don’t need to read from disk
• Disk data can be cached in memory, thus other methods require a few reads too。
• Provides advanced data structures like priority queue and set.
• The DB size can be larger than memory。
• Anti-caching approach
• Move least recent used data into disk
• But still require indexes fit into memory

### Transaction processing or analytics?

• Online transaction processing
• Looks up a small number of records by some key
• Records are inserted or updated based on user’s input
• Online analytic processing
• Scan over a huge number of records, a few columns per record
• Calculate aggregate statistics
• Data warehouse: special database for this kind of processing

#### Data warehousing

• Not safe to do OLAP on OLTP DB
• OLTP database is expected to process transactions with low latency.
• Analytic queries are often expensive
• Data warehouse
• Contains all read-only copies from all OLTP DBs
• Data is extracted from OLTP DBs periodically, transformed into an analysis-friendly schema
• The transform process is called Extract-Transform-Load (ETL).
• Without affecting OLTP DB
• Indexing works not quite good on OLAP DB compared with OLTP DB.
• Data model for it is usually relational, to use SQL.
• DB vendors are focusing on optimizing either OLAP or OLTP, not both.

#### Stars and snowflakes: Schema for analytics

• Many data warehouses use a star schema (dimensional modeling).
• A fact table in the center
• usually more than 100 columns
• A row represents an event.
• contains foreign keys to other tables (e.g., a fact table contains retail transactions, a row may have a column for store ID)
• Many dimension tables around to be referred
• who, what, where, when, how, why for the event
• snowflake schema
• Dimensions are further broken down into sub-dimensions.
• dim_product may contains foreign keys to dim_brand

### Column-oriented storage

• A fact table can have trillions of rows and hundred of columns.
• Dimension tables are usually smaller, millions of rows
• A typical data warehouse query only accesses 4 or 5 of columns at one time.
• Column oriented
• Store all the values from each column together
• Not all values from one row

#### Column compression

• Bitmap encoding
• The number n of distinct values in a column is usually small.
• Take one bitmap for each distinct value, one bit for each row.
• If n is relatively large, use run-length encoding.
• 9, 1, 3, 2, 9 zeros, 1 one, 3 zeros, 2 ones, rest zeros
• Easy to use bitwise OR operator for IN and AND for AND in SQL
• Other methods

Memory bandwidth and vectorized processing

• OLAP bottlenecks
• Disk bandwidth from disk to memory
• Bandwidth from memory to CPU cache
• Avoid branch mispredictions and bubbles in CPU instruction processing pipeline.
• Make use of single-instruction-multi-data (SIMD) instructions.
• Column-oriented storage is also good for efficient usage of CPU cycles.
• Load relatively less data and iterate through it in a tight loop without function calls.
• Vectorized processing

#### Sort order in column storage

• We can keep rows sorted and use it as an indexing.
• e.g., sorted by date and product_sku
• Reduce number of records to scan if we only want results in some range.
• Sorting can also help compression with run-length encoding.
• Mostly on the first sort key
• Can have multiple copies of data sorted by different keys

#### Writing to column-oriented storage

• All techniques mentioned may make writing difficult.
• Column-oriented storage, compression, sorting
• Use LSM-trees as a solution
• In-memory to disk
• Queries need to check both memory and disk.

#### Aggregation: Data cubes and materialized views

• Materialized aggregates
• Cache statistical result
• COUNT, SUM, AVG, …
• Materialized view
• An actual copy of the query results, written to disk
• Like a view, but don’t execute the real query on all data
• Usually used by read-heavy data warehouses
• Data cube or OLAP cube
• A grid of aggregates grouped by different dimensions
• We can aggregate the grid by row or column to get the result for less dimensions.
• Not quite flexible as querying raw data directly

## Encoding and Evolution

• We may need to change the format of data storing in DB.
• Old and new data can exist at the same time.
• Compatibility
• Forward: old code can read new data.
• Harder, old code needs to ignore additions made by new code.

### Formats for encoding data

Two representations of data:

• In memory, data is kept in objects, structs, lists, hash tables, trees, etc, to be efficiently accessed.
• In disk or during a transfer over the network, we need to encode it as some kind of self-contained sequence of bytes (e.g., JSON doc).

Translation between two representations:

• Encoding, serialization, marshalling: in-memory to byte sequence.
• Decoding, parsing, deserialization, unmarshalling: reverse.

#### Language-specific formats

• e.g.
• Java: java.io.Serializable, Kyro
• Ruby: Marshal
• Python: pickle
• Objects can be saved and restored with minimal additional code.
• Problems
• Tied to a particular language, not easy to read data with another language.
• Decoding process needs to be able to instantiate arbitrary classes.
• In order to restore data in the same object types.
• Security problem
• Attackers can get the application to instantiate arbitrary classes.
• Inconvenient problems of forward and backward compatibility

#### JSON, XML and binary variants

e.g., JSON, XML, CSV

Problems:

• Ambiguity around the encoding of numbers
• XML and CSV cannot differ number with string.
• JSON cannot distinguish integers and float numbers.
• Integers greater than $2^{53}$ cannot be exactly represented in a language using floating-point numbers (Javascript).
• JSON and XML supports unicodes but not binary strings (sequences of bytes without a char encoding).
• Need to use Base64 to encode it, but hacky and increase the data size by 33%.
• Optional schema support for both XML and JSON
• XML schema is widely used.
• Many JSON tools don’t use schemas. They need to hardcode encoding/decoding logic.
• CSV doesn’t have schema. We need to handle the changes like adding or removing columns and rows.
• Suite vague, what if the string contains comma or newline?

Binary encoding

• JSON: MessagePack, BSON, BJSON, UBJSON, BISON, and Smile
• XML: WBXML and Fast Infoset
• Others with schemas
• Thrift, protobuf, Avro
• Advantages of those with schemas
• Much more compact, no field name
• Schema is good for documentation and up-to-date automatically.
• Check f/b compatibility of schema changes.
• Generate code from schema in the case of static languages.

#### Thrift and protocol buffers

Well-known binary encoding libs:

• Thrift
• Protocol Buffers (protobuf)

Usage:

• Both require a schema for any data to be encoded.
• Contains type, field tags and length in encoded string.
• Field tags(ID) solves b/f compatibility.
• Dangerous to change data types
• 64-bit may be truncated.
• List type
• Protocol doesn’t have list type; it has a repeated field instead.
• Can support the type change between single value and multi value.
• Thrift has list type, doesn’t support such type change.

#### Avro

• Another binary encoding format as a subproject of Hadoop.
• Use a schema
• One for human editing, Avro IDL
• One more machine-readable, based on JSON
• No field tags, must decode the field in the order defined in schema.
• Encoding and decoding must use compatible schemas.
• Avro lib can resolve the differences between writer’s schema and reader’s schema.
• Translate data from the writer’s schema into reader’s schema.
• To maintain compatibility, we may only add/remove a field that has a default value.
• Changing name is tricky.

A reader needs to know writer’s schema. For 3 cases:

• Large file with a lot of records
• Include the schema at the beginning of the file.
• Database
• Version number for every record
• Store version number and schema pairs in DB.
• Espresso
• Sending records over network
• Version number
• Avro RPC

• No field tags, good for dynamically generated schemas.
• Handle schema changes easily.
• Thrift needs to manually assign tags to a new field.
• File encoded is self-describing since it includes all necessary metadata.
• No worry about code generation.
• Thrift and protobuf rely on code generation, after schema changes.
• Works well with dynamical programming languages.
• Code generation is optional.

### Modes of dataflow

#### Dataflow through databases

• Write into and read from databases.
• Backward/forward compatibility is necessary.
• Old code should keep unknown fields in new data untouched.
• Can add new fields with default null values.
• DB snapshot is usually done with latest schema.
• For backup or data warehouse.

#### Dataflow through services, REST and RPC

If processes needs to communicate over network,

• Server provides service API over network.
• Client can connect to the server and make requests to API.

Web services:

• REST
• Design philosophy based on HTTP
• Simple data format
• URLs to identify resources
• HTTP features for cache control, auth and content type negotiation
• SOAP
• XML-based protocol for making network API requests
• Independent of HTTP
• Use Web Services Description Language, or WSDL.
• Rely heavily on tool support, code generation, and IDEs.
• Not quite popular now

RPC, Remote procedure calls:

Problems with RPC:

• Network request is unpredictable.
• Request or response may be lost.
• Remote machine may be unavailable.
• Network request may return without a result, due to timeout.
• Retrying a failed network request may be duplicate executions.
• Much slower than a local call and latency is wildly variable.
• Parameters need to be encoded into a byte sequence.
• Client and server may be implemented with different languages.

RPC Implementations:

• Thrift
• Avro
• gRPC on protobuf
• Finagle on Thrift
• Rest.li on JSON and HTTP

Improvements:

• More explicit about the fact RPC is not a local call.
• Service discovery: A client can discover at which IP and port it can find a particular service.
• Main focus of RPC is on requests between services of the same org.

RPC Compatibility:

• Update server first, then client.
• Requests: backward
• Responses: forward
• RPC schemes inherit same compatibility rules of its encoding.
• e.g., Thrift, gRPC, Avro RPC
• SOAP use XML: can be evolved, but some pitfalls.
• RESTful API: JSON, good
• Clients may not update.
• Maintain compatibility.
• Or maintain multiple versions of the service.

#### Message-passing dataflow

Requests and responses are going through a message queue.

• Act like a buffer if some side is unavailable, to improve reliability.
• Redeliver messages to a process that has crashed.
• Senders don’t need to know the IP and port of the recipients.
• One message to multiple recipients.
• Decouple sender from recipient.

But the sender is usually not expected a response. The recipient can send out a message too.

Implementations:

• RabbitMQ, ActiveMQ, HornetQ, NATS, Apache Kafka
• TIBCO, IBM WebSphere, webMethods

Distributed actor frameworks:

• Actor model
• A programming model for concurrency in a single process
• Logic is encapsulated in actors; one actor is a client.
• A client communicates with others by sending and receiving asynchronous messages (may be lost).
• Each actor processes one message at a time.