Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, Randolph Y. Wang,
Serverless Network File Systems,
TOCS 14(1):41-79, Feb. 1996

(Summary by George Candea)

This paper presents xFS, a network file system based on an "anything, anywhere" design philosophy, which also introduces the concept of transforming fast LANs into I/O backplanes. It draws heavily on Zebra, LFS, and RAID. The authors deem centralized file systems as being fundamentally unscalable, performance limited, financially expensive, and offering poor availability due to single points of failure. xFS banks on the observation that, in contemporary LANs, fetching data from remote memory is faster than from local disk, and fetching from local memory is faster than from remote memory. In addition to a fast network, xFS also requires that the communicating machines trust each other's kernels.

scalable, distributed metadata and cache consistency management, allowing for dynamic post-failure reconfiguration
scalable subsetting of storage servers into groups
scalable log cleaning (since xFS is log based)

Metadata & Data

Control of metadata and cache consistency is split among several managers. Clients find managers through the manager map, which is a table indexed by some subset of bits in a file's index, akin to a page table. The map can be dynamically reconfigured as managers enter/leave the system or in order to balance load. Since the map is small and changes infrequently, it is distributed globally to all managers and clients for availability reasons. A manager controls cache consistency state and disk location metadata.

For each block, the manager keeps a list of clients caching the block or which client has exclusive write ownership.
Disk metadata consists of:
- Directories map file names to index numbers. Just like in FFS, directories are regular files.
- Imaps turns a file index numbers into the log address of the corresponding index node (inode), which in turn contains the disk addresses of the file's data blocks (or indirect blocks, if the file is large). Each manager has the imap for its index number domain.
- Each stripe group consists of a separate subset of storage servers, and clients write each segment across a stripe group rather than across all of the system's storage servers. A stripe group map maps disk log addresses to an entry in this table, which contains the stripe group's ID, members of the group, and group status (current/obsolete, depending on which servers enter/leave the system). Stripe group maps are replicated globally to all clients.

System Operation

Reads and writes operate as expected, with client buffer writes being cached in local memory until committed to a stripe group. A noteworthy decision is that of caching index nodes only at managers, not at clients. While this could allow clients to read directly from storage servers, there are three disadvantages:

since cooperative caching of blocks would not be used, it could cause more disk reads at the storage servers than necessary
once the block was read from the storage server, the block's manager would need to be told anyhow that the client is caching the block (for future invalidation purposes)
would increase complexity by requiring additional cache consistency protocols; in the xFS way, only the manager responsible for a given index number will handle the corresponding index node

Cache consistency is very similar to the one in Sprite and AFS, except that xFS manages consistency on a per-block rather than per-file basis.

xFS uses a policy called first writer: when a client creates a file, xFS chooses an index number that assigns the file's management to a manager colocated with the client. In traces run by the authors, they found that 55% of writes that required contacting a manager, 90% of the time the manager was colocated with the writer. When a manager had to invalidate stale cached blocks, the cache being invalidated was local 1/3 of the time. Finally, when clients flushed data to disk and informed the manager of the data's new storage location (because it's log based), they performed this operation locally 90% of the time. An interesting observation is that write sharing is rare - 96% of all block overwrites or deletes are by the block's previous writer.

Segment cleaning is done similarly to LFS and Zebra, with the log cleaner being distributed, to avoid turning it into a bottleneck. The responsibility for maintaining each segment's utilization status lies with the client that wrote the segment, in order to achieve bookkeeping parallelism and good locality. Clients keep segment utilization info in s-files, which are regular xFS files. There is one s-file for all segments a client has written to a stripe group, and the s-files are grouped in per-client directories. Cleaning is initiated by a leader in each stripe group whenever the number of free segments drops below a low-water mark or when the group is idle. It then distributes to each cleaner a subset of s-files to work on, hence parallelizing the process. Concurrency between cleaner updates and normal xFS writes is controlled optimistically.

Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, Randolph Y. Wang, Serverless Network File Systems, TOCS 14(1):41-79, Feb. 1996

Metadata & Data

System Operation

Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, Randolph Y. Wang,
Serverless Network File Systems,
TOCS 14(1):41-79, Feb. 1996