Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, Randolph Y. Wang,
Serverless Network File Systems,
TOCS 14(1):41-79, Feb. 1996
(Summary by George Candea)
This paper presents xFS, a network file system based on an "anything,
anywhere" design philosophy, which also introduces the concept of
transforming fast LANs into I/O backplanes. It draws heavily on Zebra,
LFS, and RAID. The authors deem centralized file systems as being
fundamentally unscalable, performance limited, financially expensive,
and offering poor availability due to single points of failure. xFS
banks on the observation that, in contemporary LANs, fetching data
from remote memory is faster than from local disk, and fetching from
local memory is faster than from remote memory. In addition to a fast
network, xFS also requires that the communicating machines trust each
other's kernels.
- scalable, distributed metadata and cache consistency management,
allowing for dynamic post-failure reconfiguration
- scalable subsetting of storage servers into groups
- scalable log cleaning (since xFS is log based)
Metadata & Data
Control of metadata and cache consistency is split among several
managers. Clients find managers through the manager map,
which is a table indexed by some subset of bits in a file's
index, akin to a page table. The map can be dynamically
reconfigured as managers enter/leave the system or in order to balance
load. Since the map is small and changes infrequently, it is
distributed globally to all managers and clients for availability
reasons.
A manager controls cache consistency state and disk location metadata.
- For each block, the manager keeps a list of clients caching the
block or which client has exclusive write ownership.
- Disk metadata consists of:
- Directories map file names to index numbers. Just
like in FFS, directories are regular files.
- Imaps turns a file index numbers into the log address of
the corresponding index node (inode), which in turn contains
the disk addresses of the file's data blocks (or indirect blocks, if
the file is large). Each manager has the imap for its index number
domain.
- Each stripe group consists of a separate subset of storage
servers, and clients write each segment across a stripe group rather
than across all of the system's storage servers. A stripe group
map maps disk log addresses to an entry in this table, which
contains the stripe group's ID, members of the group, and group status
(current/obsolete, depending on which servers enter/leave the
system). Stripe group maps are replicated globally to all clients.
System Operation
Reads and writes operate as expected, with client buffer writes being
cached in local memory until committed to a stripe group. A noteworthy
decision is that of caching index nodes only at managers, not at
clients. While this could allow clients to read directly from storage
servers, there are three disadvantages:
- since cooperative caching of blocks would not be used, it could
cause more disk reads at the storage servers than necessary
- once the block was read from the storage server, the block's
manager would need to be told anyhow that the client is caching the
block (for future invalidation purposes)
- would increase complexity by requiring additional cache
consistency protocols; in the xFS way, only the manager responsible
for a given index number will handle the corresponding index node
Cache consistency is very similar to the one in Sprite and AFS, except
that xFS manages consistency on a per-block rather than per-file basis.
xFS uses a policy called first writer: when a client creates a
file, xFS chooses an index number that assigns the file's management
to a manager colocated with the client. In traces run by the authors,
they found that 55% of writes that required contacting a manager, 90%
of the time the manager was colocated with the writer. When a manager
had to invalidate stale cached blocks, the cache being invalidated was
local 1/3 of the time. Finally, when clients flushed data to disk and
informed the manager of the data's new storage location (because it's
log based), they performed this operation locally 90% of the time. An
interesting observation is that write sharing is rare - 96% of all
block overwrites or deletes are by the block's previous writer.
Segment cleaning is done similarly to LFS and Zebra, with the log
cleaner being distributed, to avoid turning it into a bottleneck. The
responsibility for maintaining each segment's utilization status lies
with the client that wrote the segment, in order to achieve
bookkeeping parallelism and good locality. Clients keep segment
utilization info in s-files, which are regular xFS files. There
is one s-file for all segments a client has written to a stripe group,
and the s-files are grouped in per-client directories. Cleaning is
initiated by a leader in each stripe group whenever the number of free
segments drops below a low-water mark or when the group is idle. It
then distributes to each cleaner a subset of s-files to work on, hence
parallelizing the process. Concurrency between cleaner updates and
normal xFS writes is controlled optimistically.