Back to index
Measuring real-world data availability
Larry Lancaster and Alan Rowe, Network Appliance
Summary by AF
One-line summary: Using diagnostic emails automatically sent to Netapp
from its server boxes, authors measure discretionary availability (or
unavailabiltiy) of Netapp fileservers serving NFS.  Discretionary
unavailability deliberately excludes operator error, scheduled downtime,
etc. - it only captures downtime related to failure of the system under normal
operating conditions.
Overview/Main Points
  - Each diagnostic email sent by the appliance contains the cumulative minutes
    of unplanned downtime since NFS was first licensed.  Pairs of emails
    can be subtracted to get the deltas and to form a time series (event based
    model) of unplanned downtime due to appliance failures. Sometimes,
    additional log info in the email contains clues to source of outage; other
    times not.
- In principle, unplanned failure is serious because it indicates a
    "non simple" failure.  (Disks are RAIDed, memory is ECC, etc,
    so single, simple failures shouldn't cause a system failure.)
- Heuristics are used to categorize failures as: disk hardware,
    non-disk-subsystem hardware, software panic (including operator deliberately
    creating a diagnostic corefile), definite power failure (multiple units at
    same site all lose both power supplies simultaneously), likely power failure
    (single unit loses both power supplies simultaneously), operator failure
    (e.g. if operator doesn't do a clean shutdown before swapping hardware, log
    will show dual power supply "failure" followed by detection of
    swapped hardware after reboot).
- From the raw data: Software panic seems to account for 5-10% of total
    unplanned downtime (measured in minutes, not in number of failures);
    operator error for almost none; power failures for the majority (65% and up
    across multiple sites).  About 30% of discretionary downtime is spent
    waiting for replacement parts (could be reduced by keeping spare parts
    onsite).
- Availability is defined as the sum of the unavailability deltas for all
    sampled systems divided by the sum of the total lifetime of all sampled
    systems.  There are sources of random error in the downtime deltas,
    including the granularity of the delta reporting in the appliance OS (about
    10s), the manual process of factoring out time spent waiting for parts to
    arrive, etc.  To address these, authors decided to compute the standard
    error of their availability measurement; this was done using the bootrstrap
    procedure, in which you create j new sample sets, each of size N,
    by randomly sampling N poitns with replacement from the original
    sample set and repeating this process j times; you then compute the
    classical standard error over all j sets.  (This is based on the
    assumption that the availtime deltas and totaltime deltas are IID with some
    unknown distribution.)  When SE is applied, the upper figure for
    availabiltiy at one of the best-managed sites approached 5 9's (99.995).
- Of total downtime, about 42+-3% was product related; 37+-5% was customer
    deferred ("I'll fix it tomorrow"); 21+-3% was waiting for parts.
- Discretionary availability with SE ranged from 99.975+-0.002% for
    standalone systems to 99.987+-0.008% for "select clustered"
    systems.
Relevance
A real-life measurement of avaialbility (though perhaps flawed, see below)
based on collecting lots of data from systems actually in real use in the field.
Flaws
I think the statistics methods are questionable here.  On the other
hand, the authors at least computed a metric and explained how they did it, and
their approach is reproducible.  There' s a legitimate question whether the
way they computed "average" or "typical" availability by
taking a mean and a standard error is really defensible.  They did break up
the observed sites into cateogires (clustered or not, very well managed with
high customer support vs. low visibiltiy to customer support, etc.) and give the
per-category availabilities.
Also, the unavailability A is equated with the probability that data
will be unavailable during a particular sliver of time (p.98, subsection
"Discretionary data availability").  I don't think this is true
unless the "unavailability periods" are uniformly distributed over the
lifetime of the system.
Not including operator error - may make sense for this domain since Netapp
servers are 'appliances' that theoretically require zero admin... is this really
true?
Availability appears to be defined as 0 or 1 - not smoothly degrading. 
I assume this is because the server boxes don't smoothly degraded.
 
Back to index