TAMU Operating System 5 I/O and Disk Management

I/O management and device driver

Structure of I/O system

Applications
System call interface
Device interface
- Block I/O for disk, CD-ROM
- Streaming for keyboard, mouse, printer
- Network for Ethernet, Bluetooth
Device Management layer
Device driver
SW/HW interface
Device controller
Device

Device controllers vs. Device drivers

Device controller:

opcode register
- User writes command to be executed
operand register
- Stores parameters for command
busy register
- Check if completed
status register
- Check status, such as error code
data buffer
- Data is available here
- Can be used to transfer data to device

Device driver:

Write opcode and operand register
Read busy and status register
Read/write data buffer

Programmed I/O, Polling, Interrupts

To access device register:

Explicit
- Port CPU register on device register
Memory-mapped
- Map a portion of memory to the device register

Polling:

The device driver keeps checking the busy register
Frequency needs to be chosen carefully
Easy to implement
CPU bandwidth related to I/O is controlled by CPU, not device
- Better protected against malicious devices (may flooding CPU with interrupts)

Interrupts:

Interrupt is raised once the operation is done

Direct Memory Access (DMA)

It is easy to let CPU transfer data from buffer to memory, but expensive
DMA can use DMA channel to do the transfer

Modern I/O Architectures

Under the I/O management layer: we have
- File system driver
- Hard disk driver
Once a request is made
- I/O management issues a IRB (I/O Request Block)
- File system driver figures out the location on disk and sends IRB to disk driver (through I/O layer)
- Disk driver updates IRB and sends back to FS (through I/O layer)
- FS returns IRB to I/O handler
- System call returns
This structure can have a volume manager disk driver between FS and disk driver
Easy to add power management, plug-play management on top of I/O layer

Structure of device drivers

Top end
- Write, issue command
Bottom end, executed after completion of interrupt
- interrupt service routine
  - critical, executed immediately
- deferred procedure call (DPC) routine
  - can be delayed
Other parts
- Device management
  - add-device routine
  - initialization routine
  - power management
- Interaction with I/O manager
  - dispatch routines

Disks

Structure of hard disk drive

Platters, plates for data
Spindle, in the middle of platter, can rotate
Head, read data, one for a platter surface
Arm, hold heads

For a platter,

Surface
Track, a round
- Cylinder, multiple tracks with the same size
  - can be read without moving arm
Sector, a part of a track

Performance modeling

Seek latency, move arm
- time: $m * n + s$
- $m$ , track traversal time
- $n$ , num of tracks traversed
- $s$ , startup time
Rotational latency, rotate platter
- time: $1/(2r)$
- $r$ , number of revolutions per time unit
Transfer latency, transfer data from disk to memory
- time: $b/(rN)$
- $b$ , bytes to be transferred
- $N$ , bytes on a track

Disk scheduling

Goal is to optimize the seek latency. (especially, number of track traversed).

Algorithms:

FIFO
Short-Seek-Time-First
- Pros: short service times
- Cons: starvation of far requests
  - Can be improved by aging
Elevator (SCAN)
- Scan one-direction, reach one end, change direction
- Cons:
  - Few requests after the head, since just passing through
  - Requests on two ends are handled rarely
Circular SCAN
- only scan in one direction, etc, left to right
- Cons: Waste if no requests at end
LOOK, Circular-LOOK
- Improve on SCAN
- Look forward to check if there are requests at two ends

RAID

RAID: Redundant Arrays of Independent Disks, Striping + redundancy

Common characteristics of RAID:

Array of physical disks that are visible as single device to OS.
Data is distributed across physical drives of array.
Redundant disk capacity is used for error detection/correction.

Pros:
- Improved I/O performance
- Enables incremental upgrade
Cons: Reliability, more devices increase the probability of failure
- Solution: redundancy

RAID Levels

RAID is Striping + redundancy. For striping, we have:

Block-level striping
- Block is the atomic element among disks
Bit-level striping
- Strip blocks into smaller parts and distribute them among disk

Levels:

RAID 0

Block-level striped set without parity (redundancy)
Distribute blocks among disks
Pros:
- Increase capacity
- Read/Write performance
Cons:
- Increase failure rate (no tolerance)

RAID 1

Mirrored Set without parity
Performance:
- Good for read
- Penalty for write
Cons:
- Cost (100% redundancy)

RAID 2

Memory-style error-correcting parity
Strip a block into small strips and distribute
Heads and spindles synchronized
Error correction code calculated over bits of data disks (Hamming), stored in other special disks
Appropriate for systems with many failures
Typically not implemented

RAID 3

Bit-interleaved parity
Bit-level striping
Heads and spindles synchronized
Simple parity bits stored in another disk (can be used to infer the correct bits for a failed disk)

RAID 4

Block level parity
Similar to RAID 3
Disks not synchronized
Each block on parity disk contains parity information for all data disks
To update the parity bits, we don’t need to read other disk, just use old values
- use $X_{par}^{new} = X_{par}^{old} \text{ xor } X_{d}^{new} \text{ xor } X_{d}^{old}$
Problem: parity disk gets lots of loads
- Solution: RAID 5

RAID 5

Striped set with interleaved parity
Same as RAID 4, but parity spread across all data disks
No synchronization across disks
Large strips
Problem: Failure may take long time to recover
- If 2nd failure happens then, system will lose data

RAID 6

Striped set with dual interleaved parity
Use two bits to store parity
Use two independent parity functions and writes to two disks
Tolerates two failures

RAID combinations

RAID 1 + 0 = RAID 10
- RAID 0 built on RAID 1

Flash disks

NAND Flash

Solid-state persistent storage technology with “disk-like” user interface.
READs/WRITEs are sector-based (page-based)
Suited for storage of sequential data.
Random access can be “faked” by caching (“shadowing”) data in RAM main memory.

A block (e.g. 128KB) in NAND Flash is divided into several pages (e.g. 2KB).

Operations on NAND flash

Operations:

Operation	Level	Function	Latency/microsec
READ	page	read	75
PROGRAM	page	write to empty page	220
ERASE	block	erase	500

Note:

We cannot overwrite data on a page
Cannot erase a single page
Blocks start failing after about 100K PROGRAM/ERASE cycles

Flash translation layer (FTL)

Use a log-structured FTL
FTL dynamically remaps page number with a translation table
It also keeps track of state of each page, valid, empty, invalid
A pointer as head of empty page array is maintained
To overwrite a page
- write data into the first empty page
- mark old page as invalid
- map new page number in the translation table
- advance pointer of empty pages
Once empty array reaches low-water mark, do garbage collection of invalid pages
- move valid pages in combined valid/invalid pages into remaining empty pages and mark old pages as invalid
- erase blocks with all invalid pages
- change page marks and move pointer

Reference

This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1