TAMU Operating System 6 File System

Introduction

File

File: A collection of data elements, grouped together for purpose of access control, retrieval, and modification.
Persistence: Often, files are mapped onto physical storage devices, usually nonvolatile.
Some modern systems define a file simply as a sequence, or stream, of data units.

File system: interface, naming, persistence

File System: Software responsible for

creating, destroying, reading, modifying, moving files
naming files
controlling access to files
management of resources used by files.

File access

A file is perceived as an ordered collection of records
Record: Contiguous block of information transferred during a logical read/write operation
- Can be of fixed or variable length

Access methods:

Sequential access: access records in sequence
Random access: access records in arbitrary order

File organization

Pile:

Records stored in chronological order
Variable-length records
Random access to record by search of whole file
Difficult to insert/delete/modify records
- Solution: Updated record can be appended to log file and merged at some moment

Sequential:

Fixed-format records
Records often stored in order of key field.
Good for applications that process all records.
No adequate support for random access.
What about adding new record?
- Separate pile file keeps log file or transaction file.

Indexed Sequential:

Similar to sequential file, with two additions
- Index to file supports random access
- Overflow file indexed from main file
Record is added by appending it to overflow file and providing link from predecessor
Variable-length records
Multiple indices can be added for one record
One index can point to multiple records (partial index)

File allocation

Requirement of file allocation

Space on storage device must be utilized effectively
File operations must be supported efficiently
- Random access
- Sequential access
- Append/insert/delete file data
- Create/delete files

Contiguous allocation

Example: CD, DVD
File mapped onto a sequence of adjacent physical blocks
Use a simple table to store file, start, length, written on start of a disk
Pros:
- Minimize head movements
- Simple for sequential and random access
Cons:
- Difficult to change size
- Size must be known a priori
- External fragmentation
- Solution: set size while initializing, but lead to internal fragmentation

Linked allocation

Scatter logical blocks all over the disk
Link each block to next one by forward pointer
File table contains file, start block, end block
Pros:
- Easy to insert and delete blocks
- No upper limit on file size necessary a priori
Cons:
- Random access is difficult and expensive
  - Start from beginning
- Sequential access expensive
  - Head travels a lot
- Overhead for next pointers in blocks
  - Storage
- Reliability
  - If lose a block, all following blocks are lost

Variations of linked allocation

Maintain all pointers as separate list
Preferably load the list into main memory

File Allocation Table (FAT)

Usage: Many Windows products, USB, SD cards
e.g., FAT-16, file allocation tables contains 16-bits entries
One entry per cluster (a sequence of continuous physical blocks)
The table stores the index of the next blocks
The table also maintains free blocks whose next pointers are null
Table may be stored twice to tolerant failure

Indexed allocation

Keep all pointers to blocks in one location, index block
- One index block per file
Pros:
- Supports random access
- No external fragmentation
Cons:
- Internal fragmentation in index blocks
  - Trade-off: fragmentation vs. file length
Support large files: Use linked index blocks
- The last entry in index block points to another index blocks
- Multilevel indexing

UNIX Scheme

The inode of a file has 13 entries of block address
Direct indices: first 10 entries point to data blocks
Single indirect index: 1 entry points to a index block, whose entries point to data blocks
Double indirect index: 1 entry points to a single indirect index block

Unix file system

File system layout

boot block
- bootstrap code for operating system
superblock
- information about the state of file system, e.g., size, free space info
inode list
- and some other structures for file metadata
- For an inode
  - If marked large, first 7 addr point to index block, last one is double indirect.
data blocks

File-system information

// superblock
struct {
    ushort isize; // size of inode list, number of blocks reserved
    ushort fsize; // size of file system block
    ushort nfree; // counter of free blocks
    ushort free[100]; // free blocks, can use linked overflow blocks to allow more
    ushort ninode; // number of free inodes
    ushort inode[100]; // store position of free inodes in inode list, as a cache
    char flock, ilock, fmod; // lock; and if been modified, if so, write superblock back to disk
    ushort time[2]; // when modified last
}

Inodes

// inode
struct { /* 32 Byte */
    ushort flags; /* used, perm, large, type, etc. */
    char nlinks; /* # of hard links (names) to file */
    char uid; /* user id of owner */
    char gid; /* group id of owner */
    char size0; /* high byte of 24-bit size */
    ushort size1; /* low 2 byte of 24-bit size */
    ushort addr[8]; /* allocation table array */
    ushort actime[2]; /* time of last access */
    ushort modtime[2]; /* time of last modification */
}

Naming and directories

Name information is contained in separate Directory Files, which contain entries of type: inode number of file, name of file
Files can have multiple names (hard links)
- system call link("/dirA/n1", "/dirB/n2")
- system call remove name unlink("/dirA/n1")
- if number of names is 0, remove data
Can create directory file by assigning inode to it
Directory files are given names by add entries in their parent directory files
First inode in inode list is for the root directory

Evaluation

Pros:

Data in small files can be accessed directly from the inode.
- One read to fetch inode, another to fetch first data block.
Larger files can be accessed efficiently
No external fragmentation.

Cons:

Original file system uses 512-byte blocks.
- Significant overhead to read 512 bytes
- Number of entries in a block is small
Inodes kept separately from data, causing long seeks to access data.
- Head needs to move far
Inodes of files in a common directory not kept together, causing low performance when searching directories.
Data blocks of a file are not stored together.
Free list quickly scrambles, increasing overhead of finding free blocks.

Unix fast file system

FFS approach:

Increase block size
Make file system “disk-aware”
Some functional improvements

Increase block size

Increased to at least 4096 bytes
- Block size power-of-two multiple of 4K bytes
- Defined during file system creation
- Stored in superblock
Internal fragmentation increased as well
- Split blocks into fragments
- Fragments size defined at file system creation time
  - e.g., 4096/1024 has 4KB blocks and 1KB fragments
- A block can have fragments for different files

Disk-aware FS

Cylinder groups
- Groups of multiple adjacent cylinders
- Each group maintains owen copy of superblock, inode bitmap, data bitmap, inodes and data blocks
- Allocation of directories and files:
  - keep related stuff together
    - blocks of same file
    - inodes and data blocks
    - files and directories
FS parameterization
- Goal: Parameterize processor capabilities and disk characteristics so that blocks can be allocated in an optimal, configuration-dependent way.
- Allocate new block rotationally well-positioned. (minimize rotation latency)
- Disk Parameters:
  - number of blocks per track
  - disk spin rate.
- CPU Parameters:
  - expected time to service interrupt and schedule new disk transfer

Functional enhancements

Arbitrary-length file names
File locking
Symbolic links
“Rename” system call added
Quotas
- allows admin to specify max number of inodes and blocks that user can allocate

Journaled file systems

System crash

Steps to append to a file:

Write data block
Update data bitmap
Update inode

If system crashed after:

step 1, OK, FS consistent, data lost
step 2, Bad, block leaked
step 3 with order (2, 3, 1), Bad, FS inconsistent

Post-recovery, file system checkers

Check if superblock is OK
- If corrupted, can restore the data from somewhere else
Scan inodes
- All data blocks not referred in inodes can be freed in data bitmap
Update inode bitmap
Check state of inodes
- If error cannot be fixed, delete inode and data
Check link counters in inodes
- Start from root, check if consistent
Check duplicate pointer to blocks
Check for bad block pointers
- out of range
Check content of directories

Pros: No overhead during normal disk operations

Cons:

Requires very detailed information about file system
Cannot correct all types of errors
Expensive

Pro-active, journaling

Log transactions at a separated portion in disk, called journal.
- May write to memory first, the flush sometimes later
Log then flush changes to the disk
Delete transactions after successful flush
Redo transactions at reboot from crash
Write the transaction end record to the journal after writing all stuff in the disk

A typical transaction:

TX begin
Write data, bitmap, inode
TX end

Data vs. Meta-data Journaling

Write operations are slow because journaling doubles the writes.
Journal meta data only (inodes, bitmap)
Order is important
- Write data block first
- Write meta data

Journaling and block re-use

Problem: re-use blocks

Newly updated data blocks may be overwritten by previously journaled transaction
Solution:
- Do not re-use blocks until after “delete” purged from journal.
- Linux ext3: special “revoke” records, which prevent data blocks (deleted or to be deleted) to be written to disk.

Log-structured file systems

New changes in hardware

CPU speed, RAM size increases exponentially
Disk
- Transfer bandwidth increases significantly with RAID
- Latency: no major improvement
More RAM leads to
- Caches handle large portions of READ requests
- WRITE requests dominate disk traffic

Large write buffers:

Buffer large number of WRITE requests before writing to disk
Increase efficiency of individual WRITE operation
- sequential transfer rather than random
Changes may lose

Problems with Unix FFS

FFS attempts to lay out file data sequentially, but
- Files are physically separated.
- inodes are separate from file content.
- Directory entries are separate from file content.
FFS’s WRITE operations are synchronized.
- File data is written asynchronously
- Meta-data (inodes, directories, bitmaps) written synchronously
  - Need to be written to disk directly

Log-structured FS

Fundamental idea: Focus on Write performance!

Buffer file system changes in file cache.
- File data, directories, inodes, …
Write changes to disk sequentially.
- Aggregate small random writes into large asynchronous sequential writes.
- To a sequence of blocks
- Don’t overwrite old data, just append to the end

Write buffering and sequential writing

Read data

To read data from log, we need to find inodes
- Use an inode map to keep the location of inodes
LFS adds index structures in log to allow for random access.
- inode map
- Traditional “logs” require sequential scans to retrieve data
inode is identical to FFS:
- Once inode is read, number of disk I/Os to read file is same for LFS and FFS
inode position is not fixed
- Therefore, store mapping of files to inodes in inode-maps
- inode maps largely cached in memory

Write data

Find space for the blocks
- Remove stale inode and data, blocks can be freed
Maintain sufficiently long segments (free space) to allow sequential writes
- Solution 1: Thread log through available “holes”.
  - Problem: Fragmentation
- Solution 2: De-Fragment disk space (compact live data)
  - Problem: cost of copying live data.
- LFS Solution: Eliminate fragmentation through fixed-sized “holes” (segments)
  - Reclaim segments by copying segment and cleaning

Garbage collection

Compact live data in segments by:

Read a number of segments into memory.
Identify live data in these segments.
Write live data back into smaller number of segments. (update inode map as well)

To identify live data blocks, we maintain segment summary block in segment. We don’t need free block list.

Reference

This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1