Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Operating System 6 File System

2018-04-28

Introduction

File

  • File: A collection of data elements, grouped together for purpose of access control, retrieval, and modification.
  • Persistence: Often, files are mapped onto physical storage devices, usually nonvolatile.
  • Some modern systems define a file simply as a sequence, or stream, of data units.

File system: interface, naming, persistence

File System: Software responsible for

  • creating, destroying, reading, modifying, moving files
  • naming files
  • controlling access to files
  • management of resources used by files.

File access

  • A file is perceived as an ordered collection of records
  • Record: Contiguous block of information transferred during a logical read/write operation
    • Can be of fixed or variable length

Access methods:

  • Sequential access: access records in sequence
  • Random access: access records in arbitrary order

File organization

Pile:

  • Records stored in chronological order
  • Variable-length records
  • Random access to record by search of whole file
  • Difficult to insert/delete/modify records
    • Solution: Updated record can be appended to log file and merged at some moment

Sequential:

  • Fixed-format records
  • Records often stored in order of key field.
  • Good for applications that process all records.
  • No adequate support for random access.
  • What about adding new record?
    • Separate pile file keeps log file or transaction file.

Indexed Sequential:

  • Similar to sequential file, with two additions
    • Index to file supports random access
    • Overflow file indexed from main file
  • Record is added by appending it to overflow file and providing link from predecessor
  • Variable-length records
  • Multiple indices can be added for one record
  • One index can point to multiple records (partial index)

File allocation

Requirement of file allocation

  • Space on storage device must be utilized effectively
  • File operations must be supported efficiently
    • Random access
    • Sequential access
    • Append/insert/delete file data
    • Create/delete files

Contiguous allocation

  • Example: CD, DVD
  • File mapped onto a sequence of adjacent physical blocks
  • Use a simple table to store file, start, length, written on start of a disk
  • Pros:
    • Minimize head movements
    • Simple for sequential and random access
  • Cons:
    • Difficult to change size
    • Size must be known a priori
    • External fragmentation
    • Solution: set size while initializing, but lead to internal fragmentation

Linked allocation

  • Scatter logical blocks all over the disk
  • Link each block to next one by forward pointer
  • File table contains file, start block, end block
  • Pros:
    • Easy to insert and delete blocks
    • No upper limit on file size necessary a priori
  • Cons:
    • Random access is difficult and expensive
      • Start from beginning
    • Sequential access expensive
      • Head travels a lot
    • Overhead for next pointers in blocks
      • Storage
    • Reliability
      • If lose a block, all following blocks are lost

Variations of linked allocation

  • Maintain all pointers as separate list
  • Preferably load the list into main memory

File Allocation Table (FAT)

  • Usage: Many Windows products, USB, SD cards
  • e.g., FAT-16, file allocation tables contains 16-bits entries
  • One entry per cluster (a sequence of continuous physical blocks)
  • The table stores the index of the next blocks
  • The table also maintains free blocks whose next pointers are null
  • Table may be stored twice to tolerant failure

Indexed allocation

  • Keep all pointers to blocks in one location, index block
    • One index block per file
  • Pros:
    • Supports random access
    • No external fragmentation
  • Cons:
    • Internal fragmentation in index blocks
      • Trade-off: fragmentation vs. file length
  • Support large files: Use linked index blocks
    • The last entry in index block points to another index blocks
    • Multilevel indexing

UNIX Scheme

  • The inode of a file has 13 entries of block address
  • Direct indices: first 10 entries point to data blocks
  • Single indirect index: 1 entry points to a index block, whose entries point to data blocks
  • Double indirect index: 1 entry points to a single indirect index block

Unix file system

File system layout

  • boot block
    • bootstrap code for operating system
  • superblock
    • information about the state of file system, e.g., size, free space info
  • inode list
    • and some other structures for file metadata
    • For an inode
      • If marked large, first 7 addr point to index block, last one is double indirect.
  • data blocks

File-system information

// superblock
struct {
    ushort isize; // size of inode list, number of blocks reserved
    ushort fsize; // size of file system block
    ushort nfree; // counter of free blocks
    ushort free[100]; // free blocks, can use linked overflow blocks to allow more
    ushort ninode; // number of free inodes
    ushort inode[100]; // store position of free inodes in inode list, as a cache
    char flock, ilock, fmod; // lock; and if been modified, if so, write superblock back to disk
    ushort time[2]; // when modified last
}

Inodes

// inode
struct { /* 32 Byte */
    ushort flags; /* used, perm, large, type, etc. */
    char nlinks; /* # of hard links (names) to file */
    char uid; /* user id of owner */
    char gid; /* group id of owner */
    char size0; /* high byte of 24-bit size */
    ushort size1; /* low 2 byte of 24-bit size */
    ushort addr[8]; /* allocation table array */
    ushort actime[2]; /* time of last access */
    ushort modtime[2]; /* time of last modification */
}

Naming and directories

  • Name information is contained in separate Directory Files, which contain entries of type: inode number of file, name of file
  • Files can have multiple names (hard links)
    • system call link("/dirA/n1", "/dirB/n2")
    • system call remove name unlink("/dirA/n1")
    • if number of names is 0, remove data
  • Can create directory file by assigning inode to it
  • Directory files are given names by add entries in their parent directory files
  • First inode in inode list is for the root directory

Evaluation

Pros:

  • Data in small files can be accessed directly from the inode.
    • One read to fetch inode, another to fetch first data block.
  • Larger files can be accessed efficiently
  • No external fragmentation.

Cons:

  • Original file system uses 512-byte blocks.
    • Significant overhead to read 512 bytes
    • Number of entries in a block is small
  • Inodes kept separately from data, causing long seeks to access data.
    • Head needs to move far
  • Inodes of files in a common directory not kept together, causing low performance when searching directories.
  • Data blocks of a file are not stored together.
  • Free list quickly scrambles, increasing overhead of finding free blocks.

Unix fast file system

FFS approach:

  • Increase block size
  • Make file system “disk-aware”
  • Some functional improvements

Increase block size

  • Increased to at least 4096 bytes
    • Block size power-of-two multiple of 4K bytes
    • Defined during file system creation
    • Stored in superblock
  • Internal fragmentation increased as well
    • Split blocks into fragments
    • Fragments size defined at file system creation time
      • e.g., 4096/1024 has 4KB blocks and 1KB fragments
    • A block can have fragments for different files

Disk-aware FS

  1. Cylinder groups
    • Groups of multiple adjacent cylinders
    • Each group maintains owen copy of superblock, inode bitmap, data bitmap, inodes and data blocks
    • Allocation of directories and files:
      • keep related stuff together
        • blocks of same file
        • inodes and data blocks
        • files and directories
  2. FS parameterization
    • Goal: Parameterize processor capabilities and disk characteristics so that blocks can be allocated in an optimal, configuration-dependent way.
    • Allocate new block rotationally well-positioned. (minimize rotation latency)
    • Disk Parameters:
      • number of blocks per track
      • disk spin rate.
    • CPU Parameters:
      • expected time to service interrupt and schedule new disk transfer

Functional enhancements

  1. Arbitrary-length file names
  2. File locking
  3. Symbolic links
  4. “Rename” system call added
  5. Quotas
    • allows admin to specify max number of inodes and blocks that user can allocate

Journaled file systems

System crash

Steps to append to a file:

  1. Write data block
  2. Update data bitmap
  3. Update inode

If system crashed after:

  1. step 1, OK, FS consistent, data lost
  2. step 2, Bad, block leaked
  3. step 3 with order (2, 3, 1), Bad, FS inconsistent

Post-recovery, file system checkers

  1. Check if superblock is OK
    • If corrupted, can restore the data from somewhere else
  2. Scan inodes
    • All data blocks not referred in inodes can be freed in data bitmap
  3. Update inode bitmap
  4. Check state of inodes
    • If error cannot be fixed, delete inode and data
  5. Check link counters in inodes
    • Start from root, check if consistent
  6. Check duplicate pointer to blocks
  7. Check for bad block pointers
    • out of range
  8. Check content of directories

Pros: No overhead during normal disk operations

Cons:

  • Requires very detailed information about file system
  • Cannot correct all types of errors
  • Expensive

Pro-active, journaling

  • Log transactions at a separated portion in disk, called journal.
    • May write to memory first, the flush sometimes later
  • Log then flush changes to the disk
  • Delete transactions after successful flush
  • Redo transactions at reboot from crash
  • Write the transaction end record to the journal after writing all stuff in the disk

A typical transaction:

  • TX begin
  • Write data, bitmap, inode
  • TX end

Data vs. Meta-data Journaling

  • Write operations are slow because journaling doubles the writes.
  • Journal meta data only (inodes, bitmap)
  • Order is important
    • Write data block first
    • Write meta data

Journaling and block re-use

Problem: re-use blocks

  • Newly updated data blocks may be overwritten by previously journaled transaction
  • Solution:
    • Do not re-use blocks until after “delete” purged from journal.
    • Linux ext3: special “revoke” records, which prevent data blocks (deleted or to be deleted) to be written to disk.

Log-structured file systems

New changes in hardware

  • CPU speed, RAM size increases exponentially
  • Disk
    • Transfer bandwidth increases significantly with RAID
    • Latency: no major improvement
  • More RAM leads to
    • Caches handle large portions of READ requests
    • WRITE requests dominate disk traffic

Large write buffers:

  • Buffer large number of WRITE requests before writing to disk
  • Increase efficiency of individual WRITE operation
    • sequential transfer rather than random
  • Changes may lose

Problems with Unix FFS

  • FFS attempts to lay out file data sequentially, but
    • Files are physically separated.
    • inodes are separate from file content.
    • Directory entries are separate from file content.
  • FFS’s WRITE operations are synchronized.
    • File data is written asynchronously
    • Meta-data (inodes, directories, bitmaps) written synchronously
      • Need to be written to disk directly

Log-structured FS

Fundamental idea: Focus on Write performance!

  • Buffer file system changes in file cache.
    • File data, directories, inodes, …
  • Write changes to disk sequentially.
    • Aggregate small random writes into large asynchronous sequential writes.
    • To a sequence of blocks
    • Don’t overwrite old data, just append to the end

Write buffering and sequential writing

Read data

  • To read data from log, we need to find inodes
    • Use an inode map to keep the location of inodes
  • LFS adds index structures in log to allow for random access.
    • inode map
    • Traditional “logs” require sequential scans to retrieve data
  • inode is identical to FFS:
    • Once inode is read, number of disk I/Os to read file is same for LFS and FFS
  • inode position is not fixed
    • Therefore, store mapping of files to inodes in inode-maps
    • inode maps largely cached in memory

Write data

  • Find space for the blocks
    • Remove stale inode and data, blocks can be freed
  • Maintain sufficiently long segments (free space) to allow sequential writes
    • Solution 1: Thread log through available “holes”.
      • Problem: Fragmentation
    • Solution 2: De-Fragment disk space (compact live data)
      • Problem: cost of copying live data.
    • LFS Solution: Eliminate fragmentation through fixed-sized “holes” (segments)
      • Reclaim segments by copying segment and cleaning

Garbage collection

Compact live data in segments by:

  1. Read a number of segments into memory.
  2. Identify live data in these segments.
  3. Write live data back into smaller number of segments. (update inode map as well)

To identify live data blocks, we maintain segment summary block in segment. We don’t need free block list.

Reference

This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.