Back to index
Measuring real-world data availability
Larry Lancaster and Alan Rowe, Network Appliance
Summary by AF
One-line summary: Using diagnostic emails automatically sent to Netapp
from its server boxes, authors measure discretionary availability (or
unavailabiltiy) of Netapp fileservers serving NFS. Discretionary
unavailability deliberately excludes operator error, scheduled downtime,
etc. - it only captures downtime related to failure of the system under normal
operating conditions.
Overview/Main Points
- Each diagnostic email sent by the appliance contains the cumulative minutes
of unplanned downtime since NFS was first licensed. Pairs of emails
can be subtracted to get the deltas and to form a time series (event based
model) of unplanned downtime due to appliance failures. Sometimes,
additional log info in the email contains clues to source of outage; other
times not.
- In principle, unplanned failure is serious because it indicates a
"non simple" failure. (Disks are RAIDed, memory is ECC, etc,
so single, simple failures shouldn't cause a system failure.)
- Heuristics are used to categorize failures as: disk hardware,
non-disk-subsystem hardware, software panic (including operator deliberately
creating a diagnostic corefile), definite power failure (multiple units at
same site all lose both power supplies simultaneously), likely power failure
(single unit loses both power supplies simultaneously), operator failure
(e.g. if operator doesn't do a clean shutdown before swapping hardware, log
will show dual power supply "failure" followed by detection of
swapped hardware after reboot).
- From the raw data: Software panic seems to account for 5-10% of total
unplanned downtime (measured in minutes, not in number of failures);
operator error for almost none; power failures for the majority (65% and up
across multiple sites). About 30% of discretionary downtime is spent
waiting for replacement parts (could be reduced by keeping spare parts
onsite).
- Availability is defined as the sum of the unavailability deltas for all
sampled systems divided by the sum of the total lifetime of all sampled
systems. There are sources of random error in the downtime deltas,
including the granularity of the delta reporting in the appliance OS (about
10s), the manual process of factoring out time spent waiting for parts to
arrive, etc. To address these, authors decided to compute the standard
error of their availability measurement; this was done using the bootrstrap
procedure, in which you create j new sample sets, each of size N,
by randomly sampling N poitns with replacement from the original
sample set and repeating this process j times; you then compute the
classical standard error over all j sets. (This is based on the
assumption that the availtime deltas and totaltime deltas are IID with some
unknown distribution.) When SE is applied, the upper figure for
availabiltiy at one of the best-managed sites approached 5 9's (99.995).
- Of total downtime, about 42+-3% was product related; 37+-5% was customer
deferred ("I'll fix it tomorrow"); 21+-3% was waiting for parts.
- Discretionary availability with SE ranged from 99.975+-0.002% for
standalone systems to 99.987+-0.008% for "select clustered"
systems.
Relevance
A real-life measurement of avaialbility (though perhaps flawed, see below)
based on collecting lots of data from systems actually in real use in the field.
Flaws
I think the statistics methods are questionable here. On the other
hand, the authors at least computed a metric and explained how they did it, and
their approach is reproducible. There' s a legitimate question whether the
way they computed "average" or "typical" availability by
taking a mean and a standard error is really defensible. They did break up
the observed sites into cateogires (clustered or not, very well managed with
high customer support vs. low visibiltiy to customer support, etc.) and give the
per-category availabilities.
Also, the unavailability A is equated with the probability that data
will be unavailable during a particular sliver of time (p.98, subsection
"Discretionary data availability"). I don't think this is true
unless the "unavailability periods" are uniformly distributed over the
lifetime of the system.
Not including operator error - may make sense for this domain since Netapp
servers are 'appliances' that theoretically require zero admin... is this really
true?
Availability appears to be defined as 0 or 1 - not smoothly degrading.
I assume this is because the server boxes don't smoothly degraded.
Back to index