Back to index

Measuring real-world data availability

Larry Lancaster and Alan Rowe, Network Appliance

Summary by AF

One-line summary: Using diagnostic emails automatically sent to Netapp from its server boxes, authors measure discretionary availability (or unavailabiltiy) of Netapp fileservers serving NFS.  Discretionary unavailability deliberately excludes operator error, scheduled downtime, etc. - it only captures downtime related to failure of the system under normal operating conditions.

Overview/Main Points


A real-life measurement of avaialbility (though perhaps flawed, see below) based on collecting lots of data from systems actually in real use in the field.


I think the statistics methods are questionable here.  On the other hand, the authors at least computed a metric and explained how they did it, and their approach is reproducible.  There' s a legitimate question whether the way they computed "average" or "typical" availability by taking a mean and a standard error is really defensible.  They did break up the observed sites into cateogires (clustered or not, very well managed with high customer support vs. low visibiltiy to customer support, etc.) and give the per-category availabilities.

Also, the unavailability A is equated with the probability that data will be unavailable during a particular sliver of time (p.98, subsection "Discretionary data availability").  I don't think this is true unless the "unavailability periods" are uniformly distributed over the lifetime of the system.

Not including operator error - may make sense for this domain since Netapp servers are 'appliances' that theoretically require zero admin... is this really true?

Availability appears to be defined as 0 or 1 - not smoothly degrading.  I assume this is because the server boxes don't smoothly degraded.


Back to index