Peter M. Chen, David E. Lowell, Reliability Hierarchies, HotOS 1999

(Summary by George Candea)

The paper's premise is that the sharp distinction between stable storage as absolutely safe and volatile storage as absolutely unsafe is incorrect. Instead, they view data stores as a hierarchy of levels with varying reliability (cache, RAM, disk, tape), in the same way the levels have varying performance. As one moves down the performance hierarchy, cost/bit improves while overhead worsens; as one moves down the reliability hierarchy, reliability improves and overhead worsens. Here, overhead could be in terms of performance, cost, amount of power required to read/write, etc. In either of the two types of hierarchies, some apps may wish (and should be allowed) to bypass certain levels.

Common policies for transferring data down the hierarchy include those based on time as well as amount of accumulated data. In the former case, for example, data may be transferred from the buffer cache to disk at least every 15 seconds, and from disk to tape every 1 day. This suggests the notion of how old the data is before it leaves level L (call it delayL).

A transfer policy is clearly a tradeoff between overhead and reliability. To evaluate these tradeoffs, the authors define metrics for quantifying reliability. A simple one is MTTDL (mean time to data loss). As defined in the paper, however, MTTDL is only relevant when a data loss is equally bad regardless of how much data was lost. E.g., it does not distinguish between a disk failure, causing the loss of an entire file system, and a bit error that causes a memory word to be wrong.

To compensate for this, they define the data loss rate, as the fraction of data lost over time due to failures in the storage hierarchy. This measure, however, doesn't account for correlated failures. This is rectified by summing over fault types rather than reliability levels.

Some cute tidbits worth noting: