Overview in cloud computing 2

Google File System (GFS)

Fault-tolerant
Many nodes
- master, chunk servers
Single master
- Store metadata, like file and chunk namespaces
- Replicated somewhere else as shadow master
Write files by appending, never overwrite
Optimize to do replica replacement with cluster topology
- Maximize data reliability and availability
- Maximize network bandwidth utilization
Detection of stale replicas (failed to be updated at some point)

HDFS

Considers a node’s physical location when allocating storage and scheduling tasks
Tools for checking the health of the file system and rebalance data
Supports roll back: bring back the previous version of HDFS after an upgrade
Efficient operation: allows a single operator/admin to maintain a cluster of thousands of nodes

HDFS reliability in Facebook

Check it online.

Observation
- NameNode as single point of failure
- Heavy load on NameNode in large cluster
Solution
- Highly available NameNode (primary and standby)
Avatarnode
- Highly available, two-node
- hot failover and failback
- Datanode sends block reports to both the Primary and the Standby
- The host name of the current master is kept in Zookeeper
- A modified HDFS client checks Zookeeper before beginning each transaction, and again in the middle of a transaction if one fails.

HDFS scale in Facebook

Check it online.

Hive to query Hadoop using a subset of SQL
HiPal, a graphical tool that talks to Hive and enables data discovery, query authoring, charting, and dashboard creation
Prism project is the company’s answer to reach new heights in Hadoop capacity
- a logical abstraction layer is added so that a Hadoop cluster can run across multiple data centers, effectively removing limits on capacity

MapReduce

Automatic parallelization, distribution
I/O scheduling
- Load balancing
- Network and data transfer optimization
Fault tolerance
Scale out (more machines)
A master node communicate with HDFS NameNode to find input data

Steps:

Map: extract something you care about from each record
Shuffle and Sort to exchange the extracted information
Reduce: aggregate, summarize, filter, or transform
Write the results

Mappers and Reducers

Add more mappers and reducers to scale
single threaded and deterministic

Failure:

Worker failure
- Detect failure via periodic heartbeats
- Re-execute in-progress map/reduce tasks
Master failure
- Single point of failure; Resume from Execution Log
Skip bad records
Robust

Pig

A high-level processing layer that runs on Hadoop
Translated into a series of MapReduce jobs
Components
- Pig Latin
  - High-level scripting language
  - No metadata or schema required
  - Translated into MapReduce jobs
- Grunt
  - Interactive shell
- Piggybank
  - Repo for user defined functions
Data operations
- JOIN, SORT, FILTER, GROUP, ORDER, LIMIT
Data types
- Tuple, ordered set of values
- Bag, unordered collection of tuples
- Map, collection of k-v pairs
Flow
- LOAD(HCat) -> TRANSFORM(Pig) -> DUMP(Pig)
debug
- illustrate, explain, describe
- local mode

Dataflow in Google

Check it online.

handle streams and batches of big data, replacing MapReduce
Dataflow is both a software development kit and a managed service that lets customers build the data capture and transformation process that they wish to use
Competitors: Amazon Kinesis, Hadoop
MapReduce’s poor performance began once the amount of data reached the multipetabyte range
Many problems do not lend themselves to the two-step process of map and reduce
Spark can do map and reduce much faster than Hadoop can
Google Cloud Dataflow seems to have a resemblance to Spark, which also leverages memory and avoids the overhead of MapReduce
Cloud Dataflow aren’t likely to migrate petabytes of data into it from an existing Hadoop installation
more likely Cloud Dataflow will be used to enhance applications already written for Google Cloud

Cloud providers

AWS
- Pick region, available region
- EC2 for computing
- S3, EBS for store
Azure
- Many Microsoft services can be migrated to cloud
Google Cloud
- powerful infrastructures
- public cloud on those infras
VMware vCloud
- virtual data center
- software-defined
- share basic resources like management, storage, network, firewall
- Components
  - Cloud infrastructure
  - Cloud management

OpenStack

Open source cloud OS
Turn hypervisors in datacenters into pool of resources
A control layer over hypervisors
Dashboard, GUI for management
Resources:
- Compute
- Networking
- Storage
Written in python
Goal: scalability and elasticity
Optional plugins
Play with devstack to get started

Cloud storage examples

Microsoft Azure

Scalable, durable, available
Only pay for storage and bandwidth used
Can be accessed by RESTful web service
2 independent 512-bit shared secret keys

Features:

Geo-replication
Storage analytics
Background async copy
Secure access
- via HTTPS

Storage abstractions:

blobs
- simple named files with meta data
tables
- structured storage
queues
- application messages

Architecture:

Front-end -> partition servers -> distributed file system layer

Amazon S3

Elasticity to scale
Auto replicate data
Version history recover
Only pay what is used
Storage type
- S3 standard
- S3 standard infrequent access
- Amazon glacier for archive
Control authorization and security

Amazon Snowball

Migrate large scale data to S3 by shipping hardware
petabyte scale
Snowball Edge: collect data into snowball directly

Snowmobile

a truck loaded with data
up to 100 petabytes per job, 500 Gbps
site survey
dispatch
fill
return

Dropbox architecture

Services:

Processing: split file into blocks, encrypt them and synchronize only those modified
Storage: act as a Content Addressable Storage system, each individual encrypted file block retrieved based on hash value
Metadata: metadata is kept in its own discrete storage service as index for data in users’ accounts.
- MySQL
- sharded and replicated
Notification: monitor changes
- by closing polling connection

File data storage

metadata on Dropbox servers
file contents on AWS or Magic Pocket, Dropbox’s in-house storage
secure and reliable

Sync

fast, responsive
resilient, recover from failure
load-balancing, redundancy

Type:

Delta sync: only sync modified file blocks
Streaming sync: sync to other devices before uploading done
LAN sync: updated from LAN computers, without Dropbox server
- Discovery engine, find machines on the network to sync with
- Client, request file blocks
- Server, serve the requested file blocks

Reliability

Metadata
- redundant copies, N+2 availability
File content
- two geo regions
- replicated within region
Incident response
- Promptly respond to alerts of potential incidents
Business continuity
- resume or continue providing services to users
- Business impact and risk assessments
- Business continuity plans
- Plan testing/exercising
- Review and approval of BCMS
Disaster recovery
- maintain a disaster recovery plan

Data centers

third-party service providers are responsible for the physical, environmental, and operational security controls at the boundaries of Dropbox infrastructure.
Dropbox is responsible for the logical, network, and application security of our infrastructure housed at third-party data centers.

Cloud offering models

X as-a-service:

X: something made available to customers over Internet
as-a-service
- Vendor manages X completely
- Customer pays to use X based on usage
- Customer has very little control over the internal of X

Models:

IaaS
- infrastructure
- user OS on host virtualization layer
- e.g., AWS
PaaS
- platform
- user data and application on host runtime, middleware, OS
- e.g., Heroku
Saas
- software
- user on host applications
- e.g., google docs
Other
- Metal-aaS
- Security-aaS

Cloud user examples

Walmart’s cloud story

Agility
- Self-service automation (e.g., deploy VMs)
Elasticity
- scale out and in, for holidays
Datacenter-as-a-Service

Failure injection by Netflix

Embracing Failure: Fault Injection and Service Resilience at Netflix

with AWS
use CDN
Goal is to improve both availability and rate of change
design and test for failure
- difficult to test distributed systems
cause failures in production

Failures

Instance
Datacenter
Region
Latency

Cloud failures

Amazon EC2 and RDS failure in 2011

a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations.
EBS is a distributed, replicated block data store that is optimized for consistency and low latency read and write access from EC2 instances.
The configuration change was to upgrade the capacity of the primary network.
The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.
the secondary network couldn’t handle the traffic level it was receiving.
a large number of EBS nodes in a single EBS cluster lost connection to their replicas.
these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data
the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop

Fix

install a large amount of additional new capacity to replace that capacity in the cluster.
re-establishing EBS control plane API access to the affected Availability Zone and restoring access to the remaining “stuck” volumes.

Microsoft Azure failure in 2012

Cluster of 1000 servers, managed by a fabric controller
Server, managed by host agent
- Application has guest agent
- FC deploys app secrets to HA
- GA and HA communicate with each other
  - transfer certificate
  - encrypted with public and private keys

Bug:

The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.
When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.
If a clean VM (one in which no customer code has executed) times out its GA connection three times in a row, the HA decides that a hardware problem must be the cause since the GA would otherwise have reported an error.
The HA then reports to the FC that the server is faulty and the FC moves it to a state called Human Investigate (HI).
the FC has an HI threshold, that when hit, essentially moves the whole cluster to a similar HI state. At that point the FC stops all internally initiated software updates and automatic service healing is disabled.

Fix:

disabled service management functionality in all clusters worldwide
created a test and rollout plan

Improvements:

Prevention – how the system can avoid, isolate, and/or recover from failures
Detection – how to rapidly surface failures and prioritize recovery
Response – how to support our customers during an incident
Recovery – how to reduce the recovery time and impact on our customers

Amazon ELB failure in 2012

Elastic Load Balancing Service
a portion of the ELB state data was logically deleted
- used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region
issues only occurred after the ELB control plane attempted to make changes to a running load balancer.
disabled several of the ELB control plane workflows to prevent additional running load balancers from being affected by the missing ELB state data.
restore the ELB state data to a point-in-time
slowly re-enabling the ELB service workflows and APIs

Improvements:

modified the access controls on our production ELB state data to prevent inadvertent modification
modified our data recovery process
reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state.

Netflix’s take on the Amazon ELB failure in 2012

Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address that your Web browser or streaming device calls.
Requests from devices are passed by the ELB to the individual servers that run the many parts of the Netflix application.
Improvement: investigate running Netflix in more than one AWS region

Amazon S3 failure in 2017

S3 team was debugging an issue causing the S3 billing system to progress more slowly than expected.
an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
The servers that were inadvertently removed supported two other S3 subsystems.
the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.
the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects.
Removing a significant portion of the capacity caused each of these systems to require a full restart.
S3 was unable to service requests.

Improvements:

remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.
audit our other operational tools to ensure we have similar safety checks.
make changes to improve the recovery time of key S3 subsystems

Facebook outage in 2010

The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
we made a change to the persistent copy of a configuration value that was interpreted as invalid.
This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key.
As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
stop all traffic to this database cluster, which meant turning off the site.
turned off the system that attempts to correct configuration values.

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 16

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 16

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 16

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1

Overview in cloud computing 2

Google File System (GFS)

HDFS

HDFS reliability in Facebook

HDFS scale in Facebook

MapReduce

Pig

Dataflow in Google

Cloud providers

Cloud storage examples

Microsoft Azure

Amazon S3

Amazon Snowball

Snowmobile

Dropbox architecture

File data storage

Sync

Reliability

Data centers

Cloud offering models