Design and Implementation of the Sun Network Filesystem

Sandberg, Goldberg, Kleiman, Walsh, Lyon

Overview

NFS uses RPC and XDR to provide a system-independent protocol for accessing a remote filesystem. It uses a stateless, idempotent protocol to obviate the need for crash recovery. NFS is implemented in the kernel, and is transparent to existing applications; programs need do nothing different to access remote files.

The Protocol

Files and directories are identified not by pathnames (which are UNIX-specific; NFS is intended to run on non-UNIX machines as well), nor by inodes, but by file handles. A file handle is an opaque identifier for a file. File handles are returned by the server to a client after the client creates a new file or directory, or looks up a file by its component name (not pathname). In each case, the directory in which the operation is performed is identified by a file handle. The file handle for the root of the exported filesystem (called the root file handle, which has nothing to do with the user called root) is obtained via a completely separate protocol called mount. This latter protocol is the one that does all of the permission checking to see if the client is allowed to access the exported filesystem, and if it is, the root file handle is returned to it. This means that if you can find out the root file handle for some exported disk (often you can do this just by having a user account on a machine which mounts the disk; for example, the file handle for /var/spool/mail on orodruin is 0000000000001e0f00080000100021cf3a45000000080000100021cf3a450000), you can go home to your Linux box and mount the disk, often getting complete read/write access (except sometimes for files owned by root; see below). File handles, which are supposed to be opaque, are composed of a filesystem id, and the inode number and inode generation number of each of the file being referenced and the root directory (a generation number is an entry in an inode structure which is incremented when an inode is freed; this prevents a file handle from referencing the wrong file when the associated file is deleted, and a different file gets its inode number). By using this fact, it is usually possible to mount an entire remote partition from an NFS server, even if only part of the parition was meant to be exported.

When a user on the local machine accesses a remote file, the local kernel sends an RPC request which contains the file handle of the file, the numerical uid and gids of the user, the request (read, rename, etc.), and any necessary arguments. The server uses the uid and gids in the request to determine whether to grant access to the file, using normal Unix permission semantics, with the exception that root on the local machine is given the permissions of nobody on the remote machine (this seems to be the limit of the thought that was put into security). This requires a global uid/gid namespace, and also complete trust that a user (possibly with root access, but not necessarily) on one client won't send an RPC request to the server, claiming he is a different user. The server eventually returns a result (after synchronously updating its disk image) and the local kernel then returns from the system call that initiated the RPC. If the server has crashed, the local client will hang, periodically retransmitting the request, until the server comes back up and replies to it.

Implementation

In order to make NFS transparent to user programs, the concept of an inode was abstracted away (all problems in computer science...) to that of a VFS and a vnode. A VFS represents a mounted file system of any type (local, MSDOS, NFS, etc.). Each vnode represents one file in a VFS, and a set of operations (see the paper) are defined that are assumed to be all-encompasing for any file system you would want to mount. The advantage of the VFS/vnode system is that a single machine could have multiple types of file systems mounted simultaneously, in what appears to be a single directory hierarchy.

Initially, the NFS server was also part of the kernel. In order that it may block when no requests are pending, or when going to disk, a user-level process called nfsd was run, which did a single system call that never returned. This user state is used by the kernel to allow the server to sleep. (A similar trick was used until recently by the bdflush daemon in the Linux kernel; the correct solution is to use kernel threads, which Linux has done. Nowadays, nfsd is actually a user-level process on most systems, so the point is moot.) On the client side, a block I/O daemon (biod) does the same thing to provide a context to sleep in when the (blocking) RPC read-ahead and write-behind requests are not yet complete. This is done so that simultaneous requests can be made and handled.

Non-implementation

Several (arguably desirable) features were not included in the initial version of NFS:

Root filesystems

You could not have your / directory be on a remote machine. The arguments in the paper argue against having / on many clients be mapped to the same directory on the server, but there's nothing wrong with client A's / being mounted from /usr/exports/A on the server, and client B's / being mounted from /usr/exports/B on the server. More recently, this has, in fact, been achieved; a Linux system can be booted with just a network card and a boot PROM.

Filesystem Naming

The name of an exported directory on a server has nothing to do with the name of the directory on which it gets mounted on the client; the client is free to mount it on top of any directory (even another remote one). What happens if clients are also servers, and machine A has a local directory /usr/A, and machine B has a local directory /usr/B, and A mounts B:/usr/B onto /usr/A, and B mounts A:/usr/A onto /usr/B? An access to /usr/A on A would go back and forth from machine to machine, trying to resolve what directory is actually there. NFS avoided this problem by not re-exporting directories. In the above example, after the mounts, /usr/A on A would appear to contain the original contents of /usr/B on B, and vice versa. A couple of NFS daemons today have options to follow mount points when exporting filesystems; it's up to the admins not to create loops like this.

Security

As mentioned above, the only implemented security is the mapping of root to nobody. All this does is prevent root on a client machine from accessing files that only root on the server machine can access. If any other uid can access a file on the server, root on the client can just change to that user, and access the file. More recently, tiny improvements have been made. For example, most versions of mountd have an option (though it's often not used, as it disallows valid mounts from some older systems, like Ultrix) to only allow mount requests from port numbers reserved for root. This prevents regular users on a machine that happens to be able to mount a file system from a remote server from being able to find out the root file handle of the file system. Also, very very recently (I've only seen the Linux nfsd do this), nfsd will reject packets from clients that aren't listed in the exports file; before this, it was assumed that if a client knew a file handle, it must have been previously verified by mountd, so it was OK.

File Locking

NFS doesn't have any; that would be state. Sun made a separate daemon called lockd which handles lock requests. I've found it's not very useful in a heterogeneous environment. Similarly, there's a problem with multiple concurrent writers: they tend to overwrite each other's data, as there is no way to atomically append to a file.

Open File Semantics

Under Unix, access to a file is checked when it is opened. After that point, the file is supposed to be able to be made unreadable or unwritable, or even deleted entirely, and access to it should still work. Many programs open a temp file and immediately delete it, writing and reading from the open file handle. In order to make this work over stateless NFS, a client that had a remote file open would never ask the server to remove it, as long as it stayed open; it asked to have it renamed (to a temporary name) instead. When the file was closed, the temporary name was removed. This only solved the "deletion" problem, and only for a single client; if one client has a remote file open, and another client deletes it, the first client loses access.

Consistency

The paper does not mention this at all in the context of simultanoues access, but from other papers (and personal experience), we know it was basically ignored. Changes made by one client may not show up in another client for up to 30 seconds.

Concluding Remark

This is a perfect example of why not to ignore security when designing a system. Security is not a "feature" to be added in later; it must be an integral part of your system design.