Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, Randolph Y. Wang,
Serverless Network File Systems,
TOCS 14(1):41-79, Feb. 1996

(Summary by George Candea)

This paper presents xFS, a network file system based on an "anything, anywhere" design philosophy, which also introduces the concept of transforming fast LANs into I/O backplanes. It draws heavily on Zebra, LFS, and RAID. The authors deem centralized file systems as being fundamentally unscalable, performance limited, financially expensive, and offering poor availability due to single points of failure. xFS banks on the observation that, in contemporary LANs, fetching data from remote memory is faster than from local disk, and fetching from local memory is faster than from remote memory. In addition to a fast network, xFS also requires that the communicating machines trust each other's kernels.

Metadata & Data

Control of metadata and cache consistency is split among several managers. Clients find managers through the manager map, which is a table indexed by some subset of bits in a file's index, akin to a page table. The map can be dynamically reconfigured as managers enter/leave the system or in order to balance load. Since the map is small and changes infrequently, it is distributed globally to all managers and clients for availability reasons. A manager controls cache consistency state and disk location metadata.

System Operation

Reads and writes operate as expected, with client buffer writes being cached in local memory until committed to a stripe group. A noteworthy decision is that of caching index nodes only at managers, not at clients. While this could allow clients to read directly from storage servers, there are three disadvantages: Cache consistency is very similar to the one in Sprite and AFS, except that xFS manages consistency on a per-block rather than per-file basis.

xFS uses a policy called first writer: when a client creates a file, xFS chooses an index number that assigns the file's management to a manager colocated with the client. In traces run by the authors, they found that 55% of writes that required contacting a manager, 90% of the time the manager was colocated with the writer. When a manager had to invalidate stale cached blocks, the cache being invalidated was local 1/3 of the time. Finally, when clients flushed data to disk and informed the manager of the data's new storage location (because it's log based), they performed this operation locally 90% of the time. An interesting observation is that write sharing is rare - 96% of all block overwrites or deletes are by the block's previous writer.

Segment cleaning is done similarly to LFS and Zebra, with the log cleaner being distributed, to avoid turning it into a bottleneck. The responsibility for maintaining each segment's utilization status lies with the client that wrote the segment, in order to achieve bookkeeping parallelism and good locality. Clients keep segment utilization info in s-files, which are regular xFS files. There is one s-file for all segments a client has written to a stripe group, and the s-files are grouped in per-client directories. Cleaning is initiated by a leader in each stripe group whenever the number of free segments drops below a low-water mark or when the group is idle. It then distributes to each cleaner a subset of s-files to work on, hence parallelizing the process. Concurrency between cleaner updates and normal xFS writes is controlled optimistically.