Google File System (GFS)
- Many nodes
- master, chunk servers
- Single master
- Store metadata, like file and chunk namespaces
- Replicated somewhere else as shadow master
- Write files by appending, never overwrite
- Optimize to do replica replacement with cluster topology
- Maximize data reliability and availability
- Maximize network bandwidth utilization
- Detection of stale replicas (failed to be updated at some point)
- Considers a node’s physical location when allocating storage and scheduling tasks
- Tools for checking the health of the file system and rebalance data
- Supports roll back: bring back the previous version of HDFS after an upgrade
- Efficient operation: allows a single operator/admin to maintain a cluster of thousands of nodes
HDFS reliability in Facebook
Check it online.
- NameNode as single point of failure
- Heavy load on NameNode in large cluster
- Highly available NameNode (primary and standby)
- Highly available, two-node
- hot failover and failback
- Datanode sends block reports to both the Primary and the Standby
- The host name of the current master is kept in Zookeeper
- A modified HDFS client checks Zookeeper before beginning each transaction, and again in the middle of a transaction if one fails.
HDFS scale in Facebook
Check it online.
- Hive to query Hadoop using a subset of SQL
- HiPal, a graphical tool that talks to Hive and enables data discovery, query authoring, charting, and dashboard creation
- Prism project is the company’s answer to reach new heights in Hadoop capacity
- a logical abstraction layer is added so that a Hadoop cluster can run across multiple data centers, effectively removing limits on capacity
- Automatic parallelization, distribution
- I/O scheduling
- Load balancing
- Network and data transfer optimization
- Fault tolerance
- Scale out (more machines)
- A master node communicate with HDFS NameNode to find input data
- Map: extract something you care about from each record
- Shuffle and Sort to exchange the extracted information
- Reduce: aggregate, summarize, filter, or transform
- Write the results
Mappers and Reducers
- Add more mappers and reducers to scale
- single threaded and deterministic
- Worker failure
- Detect failure via periodic heartbeats
- Re-execute in-progress map/reduce tasks
- Master failure
- Single point of failure; Resume from Execution Log
- Skip bad records
- A high-level processing layer that runs on Hadoop
- Translated into a series of MapReduce jobs
- Pig Latin
- High-level scripting language
- No metadata or schema required
- Translated into MapReduce jobs
- Interactive shell
- Repo for user defined functions
- Pig Latin
- Data operations
- JOIN, SORT, FILTER, GROUP, ORDER, LIMIT
- Data types
- Tuple, ordered set of values
- Bag, unordered collection of tuples
- Map, collection of k-v pairs
- LOAD(HCat) -> TRANSFORM(Pig) -> DUMP(Pig)
- local mode
Dataflow in Google
Check it online.
- handle streams and batches of big data, replacing MapReduce
- Dataflow is both a software development kit and a managed service that lets customers build the data capture and transformation process that they wish to use
- Competitors: Amazon Kinesis, Hadoop
- MapReduce’s poor performance began once the amount of data reached the multipetabyte range
- Many problems do not lend themselves to the two-step process of map and reduce
- Spark can do map and reduce much faster than Hadoop can
- Google Cloud Dataflow seems to have a resemblance to Spark, which also leverages memory and avoids the overhead of MapReduce
- Cloud Dataflow aren’t likely to migrate petabytes of data into it from an existing Hadoop installation
- more likely Cloud Dataflow will be used to enhance applications already written for Google Cloud
- Pick region, available region
- EC2 for computing
- S3, EBS for store
- Many Microsoft services can be migrated to cloud
- Google Cloud
- powerful infrastructures
- public cloud on those infras
- VMware vCloud
- virtual data center
- share basic resources like management, storage, network, firewall
- Cloud infrastructure
- Cloud management
- Open source cloud OS
- Turn hypervisors in datacenters into pool of resources
- A control layer over hypervisors
- Dashboard, GUI for management
- Written in python
- Goal: scalability and elasticity
- Optional plugins
- Play with devstack to get started
Cloud storage examples
- Scalable, durable, available
- Only pay for storage and bandwidth used
- Can be accessed by RESTful web service
- 2 independent 512-bit shared secret keys
- Storage analytics
- Background async copy
- Secure access
- via HTTPS
- simple named files with meta data
- structured storage
- application messages
Front-end -> partition servers -> distributed file system layer
- Elasticity to scale
- Auto replicate data
- Version history recover
- Only pay what is used
- Storage type
- S3 standard
- S3 standard infrequent access
- Amazon glacier for archive
- Control authorization and security
- Migrate large scale data to S3 by shipping hardware
- petabyte scale
- Snowball Edge: collect data into snowball directly
- a truck loaded with data
- up to 100 petabytes per job, 500 Gbps
- site survey
- Processing: split file into blocks, encrypt them and synchronize only those modified
- Storage: act as a Content Addressable Storage system, each individual encrypted file block retrieved based on hash value
- Metadata: metadata is kept in its own discrete storage service as index for data in users’ accounts.
- sharded and replicated
- Notification: monitor changes
- by closing polling connection
File data storage
- metadata on Dropbox servers
- file contents on AWS or Magic Pocket, Dropbox’s in-house storage
- secure and reliable
- fast, responsive
- resilient, recover from failure
- load-balancing, redundancy
- Delta sync: only sync modified file blocks
- Streaming sync: sync to other devices before uploading done
- LAN sync: updated from LAN computers, without Dropbox server
- Discovery engine, find machines on the network to sync with
- Client, request file blocks
- Server, serve the requested file blocks
- redundant copies, N+2 availability
- File content
- two geo regions
- replicated within region
- Incident response
- Promptly respond to alerts of potential incidents
- Business continuity
- resume or continue providing services to users
- Business impact and risk assessments
- Business continuity plans
- Plan testing/exercising
- Review and approval of BCMS
- Disaster recovery
- maintain a disaster recovery plan
- third-party service providers are responsible for the physical, environmental, and operational security controls at the boundaries of Dropbox infrastructure.
- Dropbox is responsible for the logical, network, and application security of our infrastructure housed at third-party data centers.
Cloud offering models
- X: something made available to customers over Internet
- Vendor manages X completely
- Customer pays to use X based on usage
- Customer has very little control over the internal of X
- user OS on host virtualization layer
- e.g., AWS
- user data and application on host runtime, middleware, OS
- e.g., Heroku
- user on host applications
- e.g., google docs
Cloud user examples
Walmart’s cloud story
- Self-service automation (e.g., deploy VMs)
- scale out and in, for holidays
Failure injection by Netflix
Embracing Failure: Fault Injection and Service Resilience at Netflix
- with AWS
- use CDN
- Goal is to improve both availability and rate of change
- design and test for failure
- difficult to test distributed systems
- cause failures in production
Amazon EC2 and RDS failure in 2011
- a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations.
- EBS is a distributed, replicated block data store that is optimized for consistency and low latency read and write access from EC2 instances.
- The configuration change was to upgrade the capacity of the primary network.
- The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.
- the secondary network couldn’t handle the traffic level it was receiving.
- a large number of EBS nodes in a single EBS cluster lost connection to their replicas.
- these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data
- the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop
- install a large amount of additional new capacity to replace that capacity in the cluster.
- re-establishing EBS control plane API access to the affected Availability Zone and restoring access to the remaining “stuck” volumes.
Microsoft Azure failure in 2012
- Cluster of 1000 servers, managed by a fabric controller
- Server, managed by host agent
- Application has guest agent
- FC deploys app secrets to HA
- GA and HA communicate with each other
- transfer certificate
- encrypted with public and private keys
- The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.
- When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.
- If a clean VM (one in which no customer code has executed) times out its GA connection three times in a row, the HA decides that a hardware problem must be the cause since the GA would otherwise have reported an error.
- The HA then reports to the FC that the server is faulty and the FC moves it to a state called Human Investigate (HI).
- the FC has an HI threshold, that when hit, essentially moves the whole cluster to a similar HI state. At that point the FC stops all internally initiated software updates and automatic service healing is disabled.
- disabled service management functionality in all clusters worldwide
- created a test and rollout plan
- Prevention – how the system can avoid, isolate, and/or recover from failures
- Detection – how to rapidly surface failures and prioritize recovery
- Response – how to support our customers during an incident
- Recovery – how to reduce the recovery time and impact on our customers
Amazon ELB failure in 2012
- Elastic Load Balancing Service
- a portion of the ELB state data was logically deleted
- used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region
- issues only occurred after the ELB control plane attempted to make changes to a running load balancer.
- disabled several of the ELB control plane workflows to prevent additional running load balancers from being affected by the missing ELB state data.
- restore the ELB state data to a point-in-time
- slowly re-enabling the ELB service workflows and APIs
- modified the access controls on our production ELB state data to prevent inadvertent modification
- modified our data recovery process
- reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state.
Netflix’s take on the Amazon ELB failure in 2012
- Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address that your Web browser or streaming device calls.
- Requests from devices are passed by the ELB to the individual servers that run the many parts of the Netflix application.
- Improvement: investigate running Netflix in more than one AWS region
Amazon S3 failure in 2017
- S3 team was debugging an issue causing the S3 billing system to progress more slowly than expected.
- an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
- The servers that were inadvertently removed supported two other S3 subsystems.
- the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.
- the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects.
- Removing a significant portion of the capacity caused each of these systems to require a full restart.
- S3 was unable to service requests.
- remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.
- audit our other operational tools to ensure we have similar safety checks.
- make changes to improve the recovery time of key S3 subsystems
Facebook outage in 2010
- The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
- The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
- we made a change to the persistent copy of a configuration value that was interpreted as invalid.
- This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
- every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key.
- As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
- stop all traffic to this database cluster, which meant turning off the site.
- turned off the system that attempts to correct configuration values.