Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Read "Zanzibar, Google’s Consistent, Global Authorization System"

2019-07-28

Reference

Link to paper, slides

Abstract

Zanzibar is a consistent, global authorization systems used by Google Calendar, Cloud, Drive, Maps, Photos and Youtube. It scales to billions of users, trillions of ACLs (access control list) and millions of QPS. It maintains 95th latency of less than 10 milliseconds and 5-9s availability over 3 years of production use.

Introduction

Zanzibar is a unified system that allows inter-application usages.

Goals for Zanzibar:

  • Correctness
    • Consistency
  • Flexibility
    • Support a rich set of access control policies.
  • Low latency
  • High availability
  • Large scale

Zanzibar features:

  • A simple data model and a powerful configuration language
    • User can define arbitrary relations between users and objects, like owner.
    • Support set-algebraic operators such as intersection and union.
  • ACL: user U has relation R to object O.
    • U can also be a user group, referring to another ACL.
      • Commenter is also viewer.
    • User group can contains other user groups. Nested groups may lead to long chain for ACL checks.
    • An ACL check request may fan out to multiple servers.
  • All ACLs are globally replicated to server large QPS quickly.
  • Global consistency enabled by Spanner.

For low latency and high availability:

  • Its consistency protocol allows the vast majority of requests to be served with locally replicated data, without requiring cross-region round trips.
  • Handle hot spots by caching final and intermediate results, deduping simultaneous requests.
  • Handle deeply nested sets by hedging request and optimizing computations.

Model, language and API

Relation tuples

In Zanzibar, ACLs are represented as relation tuples.

  • (tuple) ::= (object)#(relation)@(user)
  • (object) ::= (namespace):(object_id)
  • (user) ::= (user_id)|(userset)
  • (userset) ::= (object)#(relation)

(userset) allows us to refer a group to support nested group membership.

Examples of tuples:

  • doc:readme#owner@10
  • group:eng#member@11
  • doc:readme#viewer@group:eng#member
    • Group eng’s members are viewers of readme.

Consistency model

“new enemy problem”, considering a scenario:

  • Alice removes Bob from the ACL of a folder.
  • Alice adds a new content into the folder.

Problems:

  • Fail to preserve ordering of ACL changes.
    • Wrong order will let Bob see the new content.
  • Misapply old ACL to new content.
    • If ACL check is evaluated on the old ACL, Bob will see the content.

Features Zanzibar provided to avoid such problems are as below. Both are enabled by Spanner (my note).

  • External consistency
    • Ensure casual ordering.
  • Snapshot reads with bounded staleness
    • Won’t see stale results.

Zookie:

  • Zanzibar zookies are basically Spanner timestamp assigned to the ACL update.
  • Write path
    • When the content modification is about to be saved, the client requests a zookie for each content version.
      • Via a content-change ACL check.
        • Does not need to be in the same transaction as the content change. (Note: Why?)
    • The client stores the content change and the zookie in an atomic write.
  • Check path
    • The client sends this zookie in subsequent ACL check requests to get fresh results compared with the update.

Advantages of using zookies:

  • Most of checks use default staleness with already replicated data.
    • Low latency.
  • Free to choose even fresher snapshot.
    • To avoid hot spots.
    • High availability.

Namespace configuration

Namespace configuration specifies:

  • Relations
    • e.g., viewer, editor
  • Storage parameters
    • Sharding setting
    • Encoding for object ID like string, integer.

Relation configs and user rewrites

Clients can define object-agnostic relationships via userset rewrite rules in relation configs. The userset rewrite rules are defined per relation. It can be used to express this relation with other relations. The expressions can be union, intersection and exclusion.

The userset rewrite rule specifies a function with an object ID as input and a userset expression tree as output. The leaf nodes of the tree can be:

  • _this
    • All users from tuples for <object#relation> pair.
    • Including indirect ACLs by usersets from tuples.
    • Default when no rules specified.
  • computed_userset
    • Compute a new userset for the input object.
    • e.g., viewer relation can refer to editor relation.
  • tuple_to_userset
    • Compute a tupleset from the input object; fetch relation tuples matching the tupleset; then compute a userset from every fetched relation tuple.

An example of relation config, expressing that viewer contains editor and viewer from parent folder; editor contains owner.

name: "doc"

relation { name: "owner" }

relation {
 name: "editor"
 userset_rewrite {
  union {
   child { _this {} }
   child { computed_userset { relation: "owner" } }
}}}

relation {
 name: "viewer"
 userset_rewrite {
  union {
   child { _this {} }
   child { computed_userset { relation: "editor" } }
   child { tuple_to_userset {
     tupleset { relation: "parent" }
     computed_userset {
       object: $TUPLE_USERSET_OBJECT  # parent folder
       relation: "viewer" 
    }}}
}}}

API

  • Read
    • A request specifies one or more tuplesets and zookie.
    • tupleset:
      • A single tuple key or all tuples with a a given object ID or userset in a namespace.
      • Can be optionally constrained by a relation name.
    • Response
      • look up a specific membership entry;
      • read all entries in an ACL or group;
      • read all groups with a given user.
    • Note Read doesn’t reflect userset rewrite rules, thus only retrieve direct users or objects.
      • Use Expand to get effective ACLs.
  • Write
    • Can modify single relation tuple.
    • Can also modify all tuples related to an object via a read-modify-write cycle.
      • Steps:
        1. Read all relation tuples of an object, including a per-object lock tuple.
        2. Generate tuples to write or delete. Send the writes along with the lock tuple.
        3. If the lock tuple is not modified, commit. Otherwise, back to step 1.
    • A write issues a content-change check against the latest snapshot.
      • If authorized, Check will response with a zookie, to be sent back in the write response to the client.
  • Watch
    • Request: one or more namespaces, zookie representing the time starting to watch.
    • Response: all tuple modification events in time order and a heartbeat zookie (representing the end time).
    • Clients can use the heartbeat zookie to send a new request to resume watching.
    • Can be used to maintain secondary indexes on client side.
  • Check
    • Request: a userset <object#relation>, a putative user and optional zookie.
  • Expand
    • Get the effective userset given an <object#relation> pair and an optional zookie.
    • Follow indirect references expressed through userset rewrite rules.
    • Result: a userset tree
      • Leaf nodes: user IDs or usersets pointing to other <object#relation> pairs.
      • Intermediate nodes: expressions like union, intersection or exclusion.

Architecture and implementation

Zanzibar architecture

  • aclservers
    • Serve Check, Read, Expand, Write requests.
    • Organized in clusters.
    • The initial server may fan out work to other servers; other server may contact another server as well.
    • The initial server gathers final result and sends it back to the client.
  • Storage
    • ACLs and metadata are stored in Spanner.
    • One database per namespace.
    • One database stores all namespace configurations.
    • One changelog database shared across all namespaces.
  • watchservers
    • Serve Watch requests.
    • Tail the changelog and serve a stream of changes.
  • Offline pipelines
    • e.g., dumps, garbage collection on old versions.
  • Leopard indexing system
    • Used to optimize operations on large and deeply nested sets.

Storage

Relation tuple storage:

  • One database per namespace.
  • Primary key to identify a row:
    • shard ID, object ID, relation, user, commit timestamp.
  • Shard ID
    • Usually determined by object ID solely.
      • If a namespace stores groups with very large numbers of members, shard ID can be computed from object ID and user.

Changelog:

  • Primary key:
    • changelog shard ID, timestamp, unique update ID.
    • Shard is randomly selected for each write.
  • Relation tuple and changelog are both updated in a single transaction.
  • Use the Spanner server hosting changelog shard as the transaction manager.
    • To minimize blocking of changelog reads on pending transactions.
      • Note: Save the time for sending commit timestamp to non-coordinator participant leaders in 2PC.

Namespace config storage:

  • Two tables:
    • One stores the configs keyed by namespace IDs.
    • The other is a changelog of config updates, keyed by commit timestamps.
  • Zanzibar servers can load all configs upon startup and monitor the changelog to refresh configs.

Replication:

  • Data is replicated to be close to clients.
    • Dozens of locations around the world.
  • 5 voting replicas in eastern and central US.
    • in 3 metropolitan areas to isolate failures;
    • within 25 ms of each other to allow Paxos transactions commit quickly.

Serving

Evaluation timestamp

Zanzibar always respects zookie if provided. Zanzibar also does statistics on out-of-zone reads probability. Default staleness bound will be updated to a safe value according to the result. Use default staleness for requests without zookies. The signal of whether it’s out-of-zone is sent back by Spanner.

Config consistency

Zanzibar chooses a single timestamp for the config to evaluate the same request on all servers. This timestamp is chosen from a range in which all timestamps are available to all servers.

  • This range is maintained by a monitoring job.
  • Each server load config changes continuously.

Check evaluation

Zanzibar converts check requests to boolean expressions. A simple example without userset rewrite rules:

  • A check request fans out on all indirect ACLs or groups, recursively.
    • Deep or write ACLs cases are optimized by Leopard indexing system.
  • All leaf nodes of the boolean expression tree are evaluated concurrently.
    • Eager cancellation: If the outcome of one node determines the result of a subtree, evaluations on other nodes are cancelled.
  • Use pooling mechanism to group reads for the same ACL check to reduce the number of RPCs to Spanner.

Leopard indexing system

For namespaces with deeply nested groups or large number of child groups, they can be selected to use Leopard indexing system to reduce latency of checks. The idea is to flatten the group to group paths. A query to Leopard system is an expression of UNION, INTERSECTION or EXCLUSION of named sets (indexing tuples). Response is a set ordered by element ID up to a specified number of results.

An indexing tuple (T, s, e):

  • T: the set type.
    • GROUP2GROUP(s) -> {e}: For a group s, return all directly or indirectly descendant groups {e}.
    • MEMBER2GROUP(s) -> {e}: For a user s, return all direct parent groups {e}.
  • s: set ID.
  • e 64-bit element ID.

Below is the check of user on group .

3 parts of Leopard system:

  • A serving system for consistent and low-latency operations across sets
    • Tuples are stored like a skip list, to perform operations (UNION, INTERSECTION) in .
    • Index is sharded by element IDs.
      • Note: Why? This way elements in the same set don’t stay together like a list.
  • An offline periodic index building system
    • Generate index shards from a snapshot of relation tuples and configs; and replicate.
    • Respect the userset rewrite rules.
  • An online layer for continuously updating the serving system as tuple changes occur
    • Incrementally update indexes since the offline snapshot.
    • An update is (T, s, e, t, d).
      • t is the timestamp of update.
      • d is deletion maker.
    • Implementation
      • Rely on Zanzibar Watch API.
      • A relation tuple update may trigger tens of thousands Leopard tuple events.
      • Every Leopard server receives the Watch update stream and do the update with minimal impact on serving.

Handling hot spots

ACL reads and checks often lead to hot spots. 4 improvements:

  • Distributed cache
    • Distributed in a cluster of servers with consistent hashing.
      • Compute forwarding key with object ID.
        • Because a check on <object#relation> often involves checks on other relation of the same object. These can be processed on the same server to save number of internal RPCs.
    • Both caller and callee of the internal RPCs cache the results.
    • Encode snapshot timestamp in cache key.
      • Avoid using caches of old snapshots.
      • Round up timestamps from the zookie.
        • Ensure freshness.
        • Most checks can share cached result at the same timestamp.
  • Lock table
    • To handle “cache stampede” problem.
      • Concurrent requests create flash hot spots before the cache is populated with results.
    • When receive concurrent requests of the same cache key, process one and block others until cache is ready.
  • Cache all relation tuples for hot check targets.
    • Many users may issue current requests on the same <object#relation> pair.
    • Cache all relation tuples for the pair.
    • Trade read bandwidth for cacheability.
    • Hot objects are determined dynamically by tracking the number of reads on each objects.
  • Delay eager cancellation
    • Eager cancellation: indirect ACL checks are often cancelled when parent check result is determined.
    • Problem: Checks are cancelled before cache keys are populated; thus concurrent requests are blocked on the lock table entry.
      • Note: The indirect ACL check is not cached. This request is the only one allowed to perform that check. Other concurrent requests are blocked on the check. Apparently, they need to wait longer if the check is cancelled.
    • Improvement: delay eager cancellation when there are waiters on the corresponding lock table entry.

Performance isolation

  • Requirement: If one client fails to provision enough resources for an unexpected usage pattern, other clients won’t be affected.
  • Control
    • CPU capacity
      • Per client
      • Throttling if exceeding the limit.
    • Total number of outstanding RPCs for memory usage
      • On Zanzibar server
      • Per server
      • Per client
    • Number of concurrent reads
      • On Spanner server
      • Per (object, client)
      • Per client
  • Use different lock table keys for different clients.
    • Prevent throttling on one client affecting others.

Tail latency mitigation

Zanzibar’s distributed processing needs to accommodate slow tasks.

  • For calls to Spanner and Leopard servers
    • Request hedging: Send the same request to multiple servers; use the first response and cancel others.
    • Place at least 2 replicas of these backends in every geographical region.
    • Practice: Send the first request; defer sending hedged requests until the initial one is known to be slow.
      • Hedging delay threshold: dynamically calculate Nth percentile latency.
      • thus only hedge a small fraction of total traffic;
  • For requests to Zanzibar servers
    • No hedging
      • Hedging is effective only when requests have similar costs.
      • Some Zanzibar checks are meant to be more time-consuming.
      • Therefore, hedging will worsen latency by duplicate expensive requests.
    • Rely on sharding and monitoring mechanisms.
      • Detect and avoid slow servers.

Experience

Some statistics as of 2019:

  • Namespace: 1,500.
  • Relation tuples
    • 2 trillion relation tuples, close to 100 terabytes.
    • Number of tuples per namespace: 10 to 1 trillion.
      • Median: 15,000.
  • Replication location: more than 30.
  • QPS: more than 10 million.
    • Example of a peak event:
      • Check: 4.2M.
      • Read: 8.2M.
      • Expand: 760K.
      • Write: 25K.
  • Server: more than 10,000.
    • Organized in several dozen clusters.
    • Number of servers per cluster: 100 to 1,000.

Requests

Two categories of requests are as below based on zookie oldness. The number of safe requests is about two orders of magnitude higher than that of recent requests. (QPS: 1.2M vs 13K as of 12/2018)

  • Safe
    • Zookies are more than 10 seconds old.
    • Requests can be served locally.
  • Recent
    • Zookies are less than 10 seconds old.
    • Requests need to be routed to leader replica, requiring inter-region round trips.

Latency

  • Measurement: server side, live traffic.
  • Check: 3ms at both safe and recent 50th.
  • Write: 127ms at the 50th.
  • Safe requests are much faster than recent ones.

Availability

  • Define an available request: succeed in 5s for safe, 15s for recent requests.
  • Measurement: replay some real requests with modified zookies and extended deadlines.
  • 99.999% available for past 3 years.

Internals

  • Delegated RPCs: Zanzibar servers contact other servers, internally.
    • 22 million QPS at peak, half for reads, half for checks.
    • Caching handles 200 million lookups per second at peak.
      • 150M for checks, 50M for reads.
    • Cache hit and RPCs saved by lock table
      • Checks
        • Delegate side: 10% cache hit, 12% by lock table.
        • Delegator side: 2% cache hit, 3% by lock table.
        • Save 500K RPCs per second.
      • Reads
        • Delegate side: 24% cache hit, 9% by lock table.
        • Delegator side: less than 1% cache hit.
        • For super-hot groups(0.1%), cache full set of members in advance.
  • Spanner side
    • 20 million read RPCs to Spanner, thanks to caching and read request pooling.
    • Number of rows per RPC: 1.5 at the median, 1,000 at the 99th.
    • Latency: 0.5ms at the median, 2ms at the 95th.
    • 1% of Spanner reads benefit from hedging.
  • Leopard system
    • 1.56M QPS at the median, 2.22M at the 99th.
    • Latency: 150 microsec at the median, under 1ms at the 99th.
    • Index incremental updates: 500 updates per second at the median, 1,500 at the 99th.

Lessons learned

  • Flexibility to accommodate differences between clients.
    • Access control patterns vary widely.
      • Continuously add new features to support clients’ use cases.
    • Freshness requirements are often but not always loose.
      • Zookie ensures bounded staleness.
  • Performance optimization to support client behaviors observed in production.
    • Request hedging is the key to reduce tail latency.
    • Hot-spot mitigation is critical for high availability.
      • Distributed cache, lock table, cache prefetching for hot items, delay eager cancellation.
    • Performance isolation protects against misbehaving clients.
  • Access control in multi-user OS
    • Multics, UNIX, POSIX ACLs, VMS.
  • Taos supports identity in distributed systems.
  • Role-based access control
    • Role is similar to relation in Zanzibar.
  • Identity and Access Management (IAM)
    • Amazon, Google, Microsoft’s cloud services.
    • Unified ACL storage and RPC-based API.
    • Google Cloud IAM is built on Zanzibar.
  • TAO is a distributed storage for social graphs in Facebook.
    • No external consistency.
  • Chubby and Zookeeper don’t have features required to be Zanzibar’s storage.

Conclusion

  • Unified access control data and logic
  • Flexible data model and configuration
  • External consistency
  • Scalability, low latency and high availability

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2019. All rights reserved by melonskin. Powered by Jekyll.