Measuring real-world data availability

Larry Lancaster and Alan Rowe, Network Appliance

Summary by AF

One-line summary: Using diagnostic emails automatically sent to Netapp from its server boxes, authors measure discretionary availability (or unavailabiltiy) of Netapp fileservers serving NFS. Discretionary unavailability deliberately excludes operator error, scheduled downtime, etc. - it only captures downtime related to failure of the system under normal operating conditions.

Overview/Main Points

Each diagnostic email sent by the appliance contains the cumulative minutes of unplanned downtime since NFS was first licensed. Pairs of emails can be subtracted to get the deltas and to form a time series (event based model) of unplanned downtime due to appliance failures. Sometimes, additional log info in the email contains clues to source of outage; other times not.
In principle, unplanned failure is serious because it indicates a "non simple" failure. (Disks are RAIDed, memory is ECC, etc, so single, simple failures shouldn't cause a system failure.)
Heuristics are used to categorize failures as: disk hardware, non-disk-subsystem hardware, software panic (including operator deliberately creating a diagnostic corefile), definite power failure (multiple units at same site all lose both power supplies simultaneously), likely power failure (single unit loses both power supplies simultaneously), operator failure (e.g. if operator doesn't do a clean shutdown before swapping hardware, log will show dual power supply "failure" followed by detection of swapped hardware after reboot).
From the raw data: Software panic seems to account for 5-10% of total unplanned downtime (measured in minutes, not in number of failures); operator error for almost none; power failures for the majority (65% and up across multiple sites). About 30% of discretionary downtime is spent waiting for replacement parts (could be reduced by keeping spare parts onsite).
Availability is defined as the sum of the unavailability deltas for all sampled systems divided by the sum of the total lifetime of all sampled systems. There are sources of random error in the downtime deltas, including the granularity of the delta reporting in the appliance OS (about 10s), the manual process of factoring out time spent waiting for parts to arrive, etc. To address these, authors decided to compute the standard error of their availability measurement; this was done using the bootrstrap procedure, in which you create j new sample sets, each of size N, by randomly sampling N poitns with replacement from the original sample set and repeating this process j times; you then compute the classical standard error over all j sets. (This is based on the assumption that the availtime deltas and totaltime deltas are IID with some unknown distribution.) When SE is applied, the upper figure for availabiltiy at one of the best-managed sites approached 5 9's (99.995).
Of total downtime, about 42+-3% was product related; 37+-5% was customer deferred ("I'll fix it tomorrow"); 21+-3% was waiting for parts.
Discretionary availability with SE ranged from 99.975+-0.002% for standalone systems to 99.987+-0.008% for "select clustered" systems.

Relevance

A real-life measurement of avaialbility (though perhaps flawed, see below) based on collecting lots of data from systems actually in real use in the field.

Flaws

I think the statistics methods are questionable here. On the other hand, the authors at least computed a metric and explained how they did it, and their approach is reproducible. There' s a legitimate question whether the way they computed "average" or "typical" availability by taking a mean and a standard error is really defensible. They did break up the observed sites into cateogires (clustered or not, very well managed with high customer support vs. low visibiltiy to customer support, etc.) and give the per-category availabilities.

Also, the unavailability A is equated with the probability that data will be unavailable during a particular sliver of time (p.98, subsection "Discretionary data availability"). I don't think this is true unless the "unavailability periods" are uniformly distributed over the lifetime of the system.

Not including operator error - may make sense for this domain since Netapp servers are 'appliances' that theoretically require zero admin... is this really true?

Availability appears to be defined as 0 or 1 - not smoothly degrading. I assume this is because the server boxes don't smoothly degraded.

Back to index