A Fast File System for UNIX

McKusik, et. al

BSD 4.2 re-implementation of UNIX file system.

Retained same interface to file I/O operations and increased performance / throughput.

Old UNIX FS

Each disk has 1 or more partitions. Each partition defined by superblock: no of blocks in FS, max files, ptr to free list. Inodes: file ownership, last modified time stamp, array of indicies point to data blocks—first 8 blocks, one singly indirect ptr, one double indirect ptr, one triply indirect ptr. (block size = 512, ptr = 4bytes => 128 entries / indirect block)

150MB FS: 4MB inodes followed by 146MB data -> inodes separated from data -> multiple disk seeks required to read one file. also, many times data is non-sequential (next data block on different cylinder) -> bad throughput

“Old FS” = first work done -> changed to 1024 byte blocks, and improved reliability (on crash, clean repair possible) -> throughput increase from 2% to 4% of available bandwidth. Main problem: free list becomes highly fragmented quickly -> files become highly fragmented -> one seek per block access!

New FS: Superblock replicated to protect against catastrophic loss. (Note superblock never modified after FS created.) Min block size = 4096 bytes -> largest possible file = 2^32 bytes with only two levels of indirection. Block size determined at FS creation.

Cylinder groups: one or more consecutive cylinders on disk. Each contains: redundant copy of superblock, space for inodes, bitmap of available blocks (fragments, actually) – no more free list, and summary usage information. One inode allocated per 2048 bytes of free space (more than will ever be needed). Cylinder group info NOT stored at beginning of cylinder (as this would place all info on top platter of disk) -> bad for reliability -> single crash on top platter would wipe out many cylinder groups. Solution: offset the start of cylinder group info by one track -> spirals redundant info on disk. (Use space before cylinder group info start for data blocks.)

4K blocks are big (most UNIX files are small) -> lots O(50%) of wasted space if only one block size supported -> blocks divided into fragments. w

Linus Torvalds believes fragments are bad. Solaris uses fragments. What are the advantages/disadvantages?

Expanding files one fragment at a time may result in re-copied fragments many times as it is expanded into a full block. (App can minimize this affect by writing one block at a time.)

Using fragments reduced wasted space to about same as was in 1K block FS. In general, disk utilization about same when new FS fragment size = old FS block size.

Must reserve some space to keep disk access efficient (i.e., 90% full). As free blocks -> 0, throughput -> 50% optimal.

New FS parameterized to hardware / processor characteristics. For example, if processor is slow, skip section allocation can be used. generalizing -> superblock contains rotational layout tables. (I didn’t understand this... clarify with somebody). Moving a disk that is parameterized for a 2ms latency to a processor that has a 4ms latency would have a very bad effect on performance – it will miss the next block every time!

Layout Policies: global routines and local routines

Global policies: decides on placement of files and directories, spreads unrelated data in different cylinder groups, localize data that is concurrently accessed (same disk). (i.e., put all inodes for files in the same directory in same cylinder group so “ls” runs fast; put all data blocks of same file in same cylinder group, preferably at rotationally optimal locations in a cylinder. if file size > 48K, spread out across multiple cylinder groups to prevent any one from getting too full.)

Local policies: use locally optimal scheme to lay out data in blocks

if a given block is requested but is already being used:

1) look for next rotationally close block (same cylinder)

2) same cylinder group

3) hash cylinder group number to figure out next cylinder group to go to
(“quadratic” hash is used because good at finding unused spots in full hash tables)

4) exhaustive search across all cylinder group

Performance: (compared to old UNIX file org)

ls is 2x faster for dirs with many sub-dirs; up to 8x as fast with only large # of files

(tests had to be run w/ enough data read/written so that OS bufferng would not affect results)

In new FS, transfer rates don’t change over time like in old FS. (Files don’t get fragmented and split up as much due to new policies :)) Instead, performance only gets worse when disk is over 90% full. New FS gets up to 50% disk bandwidth mostly due to larger block size and better layout. Perf is now limited by speed of memory copy from kernel disk buffers to user space buffers (this takes 40% of the time). Didn’t optimize any further because otherwise would have to make user buffers aligned on page boundries to take advantage of DMA.

Further improvements can be obtained by chaining together kernel buffers to read/write more than one block at a time, and allocating blocks to files more than one at a time for large writes. Neither of these implemented because current bottleneck was processor speed.

Other Enhancements.

Long filenames. Fixed size directory entires... max 255 chars in filename

File Locking. Hard locks—enforced by OS. Advisory locks—programs cooperate but no hard enforcement. UNIX sysadmin can override either scheme, and processes running as root would need a separate mechanism since the process would get lock whenever it wants. Advisory locks: shared or exclusive. Many processes can hold shared or one can hold exclusive. (Files must be open to obtain locks, and don’t need to be closed before a lock can be obtained. This allows a process to upgrade from a shared to an exclusive lock without having to close the file.) If lock can’t be granted, process blocks or can alternatively use a non-blocking call that returns with an error if lock cannot be granted. No deadlock detection implemented.

Symbolic links. Hard link—directory links to inode for a file that can be in other directories as well. Soft link— a file that contains a pathname to file being linked to. Pathname can be either absolute or relative. Three system calls were added to support detection, reading, and writing of symbolic links.

System call for renaming, and ability to enforce quotas (soft limit & hard limits) also added.