Introduction
File
- File: A collection of data elements, grouped together for purpose of access control, retrieval, and modification.
- Persistence: Often, files are mapped onto physical storage devices, usually nonvolatile.
- Some modern systems define a file simply as a sequence, or stream, of data units.
File system: interface, naming, persistence
File System: Software responsible for
- creating, destroying, reading, modifying, moving files
- naming files
- controlling access to files
- management of resources used by files.
File access
- A file is perceived as an ordered collection of records
- Record: Contiguous block of information transferred during a logical read/write operation
- Can be of fixed or variable length
Access methods:
- Sequential access: access records in sequence
- Random access: access records in arbitrary order
File organization
Pile:
- Records stored in chronological order
- Variable-length records
- Random access to record by search of whole file
- Difficult to insert/delete/modify records
- Solution: Updated record can be appended to log file and merged at some moment
Sequential:
- Fixed-format records
- Records often stored in order of key field.
- Good for applications that process all records.
- No adequate support for random access.
- What about adding new record?
- Separate pile file keeps log file or transaction file.
Indexed Sequential:
- Similar to sequential file, with two additions
- Index to file supports random access
- Overflow file indexed from main file
- Record is added by appending it to overflow file and providing link from predecessor
- Variable-length records
- Multiple indices can be added for one record
- One index can point to multiple records (partial index)
File allocation
Requirement of file allocation
- Space on storage device must be utilized effectively
- File operations must be supported efficiently
- Random access
- Sequential access
- Append/insert/delete file data
- Create/delete files
Contiguous allocation
- Example: CD, DVD
- File mapped onto a sequence of adjacent physical blocks
- Use a simple table to store
file, start, length
, written on start of a disk - Pros:
- Minimize head movements
- Simple for sequential and random access
- Cons:
- Difficult to change size
- Size must be known a priori
- External fragmentation
- Solution: set size while initializing, but lead to internal fragmentation
Linked allocation
- Scatter logical blocks all over the disk
- Link each block to next one by forward pointer
- File table contains
file, start block, end block
- Pros:
- Easy to insert and delete blocks
- No upper limit on file size necessary a priori
- Cons:
- Random access is difficult and expensive
- Start from beginning
- Sequential access expensive
- Head travels a lot
- Overhead for next pointers in blocks
- Storage
- Reliability
- If lose a block, all following blocks are lost
- Random access is difficult and expensive
Variations of linked allocation
- Maintain all pointers as separate list
- Preferably load the list into main memory
File Allocation Table (FAT)
- Usage: Many Windows products, USB, SD cards
- e.g., FAT-16, file allocation tables contains 16-bits entries
- One entry per cluster (a sequence of continuous physical blocks)
- The table stores the index of the next blocks
- The table also maintains free blocks whose next pointers are null
- Table may be stored twice to tolerant failure
Indexed allocation
- Keep all pointers to blocks in one location, index block
- One index block per file
- Pros:
- Supports random access
- No external fragmentation
- Cons:
- Internal fragmentation in index blocks
- Trade-off: fragmentation vs. file length
- Internal fragmentation in index blocks
- Support large files: Use linked index blocks
- The last entry in index block points to another index blocks
- Multilevel indexing
UNIX Scheme
- The inode of a file has 13 entries of block address
- Direct indices: first 10 entries point to data blocks
- Single indirect index: 1 entry points to a index block, whose entries point to data blocks
- Double indirect index: 1 entry points to a single indirect index block
Unix file system
File system layout
- boot block
- bootstrap code for operating system
- superblock
- information about the state of file system, e.g., size, free space info
- inode list
- and some other structures for file metadata
- For an inode
- If marked large, first 7 addr point to index block, last one is double indirect.
- data blocks
File-system information
// superblock
struct {
ushort isize; // size of inode list, number of blocks reserved
ushort fsize; // size of file system block
ushort nfree; // counter of free blocks
ushort free[100]; // free blocks, can use linked overflow blocks to allow more
ushort ninode; // number of free inodes
ushort inode[100]; // store position of free inodes in inode list, as a cache
char flock, ilock, fmod; // lock; and if been modified, if so, write superblock back to disk
ushort time[2]; // when modified last
}
Inodes
// inode
struct { /* 32 Byte */
ushort flags; /* used, perm, large, type, etc. */
char nlinks; /* # of hard links (names) to file */
char uid; /* user id of owner */
char gid; /* group id of owner */
char size0; /* high byte of 24-bit size */
ushort size1; /* low 2 byte of 24-bit size */
ushort addr[8]; /* allocation table array */
ushort actime[2]; /* time of last access */
ushort modtime[2]; /* time of last modification */
}
Naming and directories
- Name information is contained in separate Directory Files, which contain entries of type:
inode number of file, name of file
- Files can have multiple names (hard links)
- system call
link("/dirA/n1", "/dirB/n2")
- system call remove name
unlink("/dirA/n1")
- if number of names is 0, remove data
- system call
- Can create directory file by assigning inode to it
- Directory files are given names by add entries in their parent directory files
- First inode in inode list is for the root directory
Evaluation
Pros:
- Data in small files can be accessed directly from the inode.
- One read to fetch inode, another to fetch first data block.
- Larger files can be accessed efficiently
- No external fragmentation.
Cons:
- Original file system uses 512-byte blocks.
- Significant overhead to read 512 bytes
- Number of entries in a block is small
- Inodes kept separately from data, causing long seeks to access data.
- Head needs to move far
- Inodes of files in a common directory not kept together, causing low performance when searching directories.
- Data blocks of a file are not stored together.
- Free list quickly scrambles, increasing overhead of finding free blocks.
Unix fast file system
FFS approach:
- Increase block size
- Make file system “disk-aware”
- Some functional improvements
Increase block size
- Increased to at least 4096 bytes
- Block size power-of-two multiple of 4K bytes
- Defined during file system creation
- Stored in superblock
- Internal fragmentation increased as well
- Split blocks into fragments
- Fragments size defined at file system creation time
- e.g., 4096/1024 has 4KB blocks and 1KB fragments
- A block can have fragments for different files
Disk-aware FS
- Cylinder groups
- Groups of multiple adjacent cylinders
- Each group maintains owen copy of superblock, inode bitmap, data bitmap, inodes and data blocks
- Allocation of directories and files:
- keep related stuff together
- blocks of same file
- inodes and data blocks
- files and directories
- keep related stuff together
- FS parameterization
- Goal: Parameterize processor capabilities and disk characteristics so that blocks can be allocated in an optimal, configuration-dependent way.
- Allocate new block rotationally well-positioned. (minimize rotation latency)
- Disk Parameters:
- number of blocks per track
- disk spin rate.
- CPU Parameters:
- expected time to service interrupt and schedule new disk transfer
Functional enhancements
- Arbitrary-length file names
- File locking
- Symbolic links
- “Rename” system call added
- Quotas
- allows admin to specify max number of inodes and blocks that user can allocate
Journaled file systems
System crash
Steps to append to a file:
- Write data block
- Update data bitmap
- Update inode
If system crashed after:
- step 1, OK, FS consistent, data lost
- step 2, Bad, block leaked
- step 3 with order (2, 3, 1), Bad, FS inconsistent
Post-recovery, file system checkers
- Check if superblock is OK
- If corrupted, can restore the data from somewhere else
- Scan inodes
- All data blocks not referred in inodes can be freed in data bitmap
- Update inode bitmap
- Check state of inodes
- If error cannot be fixed, delete inode and data
- Check link counters in inodes
- Start from root, check if consistent
- Check duplicate pointer to blocks
- Check for bad block pointers
- out of range
- Check content of directories
Pros: No overhead during normal disk operations
Cons:
- Requires very detailed information about file system
- Cannot correct all types of errors
- Expensive
Pro-active, journaling
- Log transactions at a separated portion in disk, called journal.
- May write to memory first, the flush sometimes later
- Log then flush changes to the disk
- Delete transactions after successful flush
- Redo transactions at reboot from crash
- Write the
transaction end
record to the journal after writing all stuff in the disk
A typical transaction:
- TX begin
- Write data, bitmap, inode
- TX end
Data vs. Meta-data Journaling
- Write operations are slow because journaling doubles the writes.
- Journal meta data only (inodes, bitmap)
- Order is important
- Write data block first
- Write meta data
Journaling and block re-use
Problem: re-use blocks
- Newly updated data blocks may be overwritten by previously journaled transaction
- Solution:
- Do not re-use blocks until after “delete” purged from journal.
- Linux ext3: special “revoke” records, which prevent data blocks (deleted or to be deleted) to be written to disk.
Log-structured file systems
New changes in hardware
- CPU speed, RAM size increases exponentially
- Disk
- Transfer bandwidth increases significantly with RAID
- Latency: no major improvement
- More RAM leads to
- Caches handle large portions of READ requests
- WRITE requests dominate disk traffic
Large write buffers:
- Buffer large number of WRITE requests before writing to disk
- Increase efficiency of individual WRITE operation
- sequential transfer rather than random
- Changes may lose
Problems with Unix FFS
- FFS attempts to lay out file data sequentially, but
- Files are physically separated.
- inodes are separate from file content.
- Directory entries are separate from file content.
- FFS’s WRITE operations are synchronized.
- File data is written asynchronously
- Meta-data (inodes, directories, bitmaps) written synchronously
- Need to be written to disk directly
Log-structured FS
Fundamental idea: Focus on Write performance!
- Buffer file system changes in file cache.
- File data, directories, inodes, …
- Write changes to disk sequentially.
- Aggregate small random writes into large asynchronous sequential writes.
- To a sequence of blocks
- Don’t overwrite old data, just append to the end
Write buffering and sequential writing
Read data
- To read data from log, we need to find inodes
- Use an inode map to keep the location of inodes
- LFS adds index structures in log to allow for random access.
- inode map
- Traditional “logs” require sequential scans to retrieve data
- inode is identical to FFS:
- Once inode is read, number of disk I/Os to read file is same for LFS and FFS
- inode position is not fixed
- Therefore, store mapping of files to inodes in inode-maps
- inode maps largely cached in memory
Write data
- Find space for the blocks
- Remove stale inode and data, blocks can be freed
- Maintain sufficiently long segments (free space) to allow sequential writes
- Solution 1: Thread log through available “holes”.
- Problem: Fragmentation
- Solution 2: De-Fragment disk space (compact live data)
- Problem: cost of copying live data.
- LFS Solution: Eliminate fragmentation through fixed-sized “holes” (segments)
- Reclaim segments by copying segment and cleaning
- Solution 1: Thread log through available “holes”.
Garbage collection
Compact live data in segments by:
- Read a number of segments into memory.
- Identify live data in these segments.
- Write live data back into smaller number of segments. (update inode map as well)
To identify live data blocks, we maintain segment summary block in segment. We don’t need free block list.
Reference
This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.