Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Operating System 5 I/O and Disk Management

2018-04-28

I/O management and device driver

Structure of I/O system

  1. Applications
  2. System call interface
  3. Device interface
    • Block I/O for disk, CD-ROM
    • Streaming for keyboard, mouse, printer
    • Network for Ethernet, Bluetooth
  4. Device Management layer
  5. Device driver
  6. SW/HW interface
  7. Device controller
  8. Device

Device controllers vs. Device drivers

Device controller:

  • opcode register
    • User writes command to be executed
  • operand register
    • Stores parameters for command
  • busy register
    • Check if completed
  • status register
    • Check status, such as error code
  • data buffer
    • Data is available here
    • Can be used to transfer data to device

Device driver:

  • Write opcode and operand register
  • Read busy and status register
  • Read/write data buffer

Programmed I/O, Polling, Interrupts

To access device register:

  • Explicit
    • Port CPU register on device register
  • Memory-mapped
    • Map a portion of memory to the device register

Polling:

  • The device driver keeps checking the busy register
  • Frequency needs to be chosen carefully
  • Easy to implement
  • CPU bandwidth related to I/O is controlled by CPU, not device
    • Better protected against malicious devices (may flooding CPU with interrupts)

Interrupts:

  • Interrupt is raised once the operation is done

Direct Memory Access (DMA)

  • It is easy to let CPU transfer data from buffer to memory, but expensive
  • DMA can use DMA channel to do the transfer

Modern I/O Architectures

  • Under the I/O management layer: we have
    • File system driver
    • Hard disk driver
  • Once a request is made
    • I/O management issues a IRB (I/O Request Block)
    • File system driver figures out the location on disk and sends IRB to disk driver (through I/O layer)
    • Disk driver updates IRB and sends back to FS (through I/O layer)
    • FS returns IRB to I/O handler
    • System call returns
  • This structure can have a volume manager disk driver between FS and disk driver
  • Easy to add power management, plug-play management on top of I/O layer

Structure of device drivers

  • Top end
    • Write, issue command
  • Bottom end, executed after completion of interrupt
    • interrupt service routine
      • critical, executed immediately
    • deferred procedure call (DPC) routine
      • can be delayed
  • Other parts
    • Device management
      • add-device routine
      • initialization routine
      • power management
    • Interaction with I/O manager
      • dispatch routines

Disks

Structure of hard disk drive

  • Platters, plates for data
  • Spindle, in the middle of platter, can rotate
  • Head, read data, one for a platter surface
  • Arm, hold heads

For a platter,

  • Surface
  • Track, a round
    • Cylinder, multiple tracks with the same size
      • can be read without moving arm
  • Sector, a part of a track

Performance modeling

  • Seek latency, move arm
    • time:
    • , track traversal time
    • , num of tracks traversed
    • , startup time
  • Rotational latency, rotate platter
    • time:
    • , number of revolutions per time unit
  • Transfer latency, transfer data from disk to memory
    • time:
    • , bytes to be transferred
    • , bytes on a track

Disk scheduling

Goal is to optimize the seek latency. (especially, number of track traversed).

Algorithms:

  • FIFO
  • Short-Seek-Time-First
    • Pros: short service times
    • Cons: starvation of far requests
      • Can be improved by aging
  • Elevator (SCAN)
    • Scan one-direction, reach one end, change direction
    • Cons:
      • Few requests after the head, since just passing through
      • Requests on two ends are handled rarely
  • Circular SCAN
    • only scan in one direction, etc, left to right
    • Cons: Waste if no requests at end
  • LOOK, Circular-LOOK
    • Improve on SCAN
    • Look forward to check if there are requests at two ends

RAID

RAID: Redundant Arrays of Independent Disks, Striping + redundancy

Common characteristics of RAID:

  1. Array of physical disks that are visible as single device to OS.
  2. Data is distributed across physical drives of array.
  3. Redundant disk capacity is used for error detection/correction.
  • Pros:
    • Improved I/O performance
    • Enables incremental upgrade
  • Cons: Reliability, more devices increase the probability of failure
    • Solution: redundancy

RAID Levels

RAID is Striping + redundancy. For striping, we have:

  • Block-level striping
    • Block is the atomic element among disks
  • Bit-level striping
    • Strip blocks into smaller parts and distribute them among disk

Levels:

RAID 0

  • Block-level striped set without parity (redundancy)
  • Distribute blocks among disks
  • Pros:
    • Increase capacity
    • Read/Write performance
  • Cons:
    • Increase failure rate (no tolerance)

RAID 1

  • Mirrored Set without parity
  • Performance:
    • Good for read
    • Penalty for write
  • Cons:
    • Cost (100% redundancy)

RAID 2

  • Memory-style error-correcting parity
  • Strip a block into small strips and distribute
  • Heads and spindles synchronized
  • Error correction code calculated over bits of data disks (Hamming), stored in other special disks
  • Appropriate for systems with many failures
  • Typically not implemented

RAID 3

  • Bit-interleaved parity
  • Bit-level striping
  • Heads and spindles synchronized
  • Simple parity bits stored in another disk (can be used to infer the correct bits for a failed disk)

RAID 4

  • Block level parity
  • Similar to RAID 3
  • Disks not synchronized
  • Each block on parity disk contains parity information for all data disks
  • To update the parity bits, we don’t need to read other disk, just use old values
    • use
  • Problem: parity disk gets lots of loads
    • Solution: RAID 5

RAID 5

  • Striped set with interleaved parity
  • Same as RAID 4, but parity spread across all data disks
  • No synchronization across disks
  • Large strips
  • Problem: Failure may take long time to recover
    • If 2nd failure happens then, system will lose data

RAID 6

  • Striped set with dual interleaved parity
  • Use two bits to store parity
  • Use two independent parity functions and writes to two disks
  • Tolerates two failures

RAID combinations

  • RAID 1 + 0 = RAID 10
    • RAID 0 built on RAID 1

Flash disks

NAND Flash

  • Solid-state persistent storage technology with “disk-like” user interface.
  • READs/WRITEs are sector-based (page-based)
  • Suited for storage of sequential data.
  • Random access can be “faked” by caching (“shadowing”) data in RAM main memory.

A block (e.g. 128KB) in NAND Flash is divided into several pages (e.g. 2KB).

Operations on NAND flash

Operations:

Operation Level Function Latency/microsec
READ page read 75
PROGRAM page write to empty page 220
ERASE block erase 500

Note:

  • We cannot overwrite data on a page
  • Cannot erase a single page
  • Blocks start failing after about 100K PROGRAM/ERASE cycles

Flash translation layer (FTL)

  • Use a log-structured FTL
  • FTL dynamically remaps page number with a translation table
  • It also keeps track of state of each page, valid, empty, invalid
  • A pointer as head of empty page array is maintained
  • To overwrite a page
    • write data into the first empty page
    • mark old page as invalid
    • map new page number in the translation table
    • advance pointer of empty pages
  • Once empty array reaches low-water mark, do garbage collection of invalid pages
    • move valid pages in combined valid/invalid pages into remaining empty pages and mark old pages as invalid
    • erase blocks with all invalid pages
    • change page marks and move pointer

Reference

This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2025. All rights reserved by melonskin. Powered by Jekyll.