07/01/01 Notes from EASY and/or DSN 2001

Henrique Madeira and Phil Koopman: Dependability is not a niche area, and must lead to real improvements (not just marketing) - representativeness of real world conditions. Fault-inj strategy must model faults, upsets, and stressful conditions (human err, config changes/upgrades, etc), and relative percentages of each. (Better to get industrial data on what kinds of faults actually occur in specific application classes)
Rich Martin, The Shape of Failure: if you just rebooted a machine recently, you're more likely to reboot it again soon. The time-to-reboot (TTB) of machines in a cluster follows a Weibull distribution with shape parameter < 1. This assumes reboot events are independent, which may not be the case, and this cluster isn't really representative of an "industrial conditions" cluster in terms of uniformity and administration. (Are reboots independent events in RR? If not, how does this affect overall availability? Be sure assumptions about uptime don't rely on independence unless we can argue for independence, or force it to be the case by introducing randomness.)
Can RR be used to enforce predictable performance by keeping components near their initial-state performance? Does performance of nodes degrade over time? Reread Richardson and others on detecting deviations from "normal" performance, could restarting help? (short paper)
Extension of H/Y: some services can be called dependable if transient faults can be masked by H/Y tradeoffs
A lot of high-dependability systems are shipped as "solutions" or appliances, not general systems. Examples include: Netapp boxes; IBM S/390 (see Geo's SIGDEB summary); Internet clusters (outsourced, so only the providers can administer/change); embedded systems. Corollary: churn is bad. This is all consistent with Aaron's summary of human errors (slips and lapses, etc): fewer people doing the tasks with less variation in the tasks should lead to fewer human-induced errors.

A. Avizienis, Immune System Paradigm for F/T systems

COTS uprocs are handicapped w/r/t error detection. E.g. P6 has ECC data bus, parity on A-bus and caches, no error detection for anything else (incl. MCA, arithmetic...)
FRCERR pin (deleted from all specs in Apr 98): near-100% error detection and containment using master/slave CPU pair, pin would be asserted on master by Checker in case of mismatch. Conjecture - some async inputs couldn't be handled. (More likely: too expensive to get it right, or chip area needed for sth else) As a result, no support for multichannel computing and N-version design, which are fundamental to many safety-crit systems; in PIII, since March 1999, ~79 bugs have been announced of which ~50 remain unfixed.
IBM's S/390 G5 and G6 procs have internally duplicated ALU and instruction processing units
FT infrastructure: includes components to do various "orthogonal" things (checking, startup/shutdown, etc; see paper for details), each of which can be implemented in 20-30k gates in FPGA's (could probably be exhaustively or formally validated, though he didn't claim this)

An agenda:

Clearly define a class of cluster based services using quantitative model that captures H/Y, result fidelity (DQ), consistency (TACT, SSTP), cluster granularity, functionality (queries, record updates, reconciliation, etc), utility of recovery vs. timeliness of recovery. Model should be able to characterize workload as well: are some transacts/queries more important than others? What is the cost of losing a trans. vs. delaying reply vs. decreasing fidelity?
Indicate how some specific services are captured by model
Indicate the extent to which RR can be employed by model as a fn of the parameters
Get industrial data on what kinds of faults occur in this app class
show how to use RR to mitigate faults, distribute them more evenly using randomness (ie avoid worst cases), increase uptime, etc

Background reading: D.S. Katz and P.L. Springer, Development of a spaceborne embedded cluster, in IEEE Cluster 2000, Chemnitz, Germany

Breakout session with Brendan Murphy

Gap between availability as measured inside vs outside your boundary of control, eg inside your cluster vs your cluster as seen by the aggregate of users. Makes it hard if you have things like geographically-distributed secondary: if a network failure (vs a server failure) diverts users to that secondary, your availability is effectively primary AND secondary (not primary OR secondary). It's all in who controls the failover.

There are (implicit) SLA's every time you cross a control/administrative boundary. Make the boundaries well defined and the SLA's well defined (can be probabilistic); design measurability into the interfaces/SLA's, so we can identify which components aren't living up to their SLA's. Can also syntehsize end-to-end availability from this, in addition to measuring it.

Duetschs's fallacies ("transparency considered harmful"): (from Greg Papadopoulos ISCA/DSN keynote)

The network is reliable
latency is 0
b/w is infinite
network is secure
topologyu doesn't change
there is a netowkr admin
transport is free
network is homogeneous

Corollaries:

network unplannable./unknownable
sys arch driven from edege
component decomp is the "thermodynamic certainty" of the industry

Greg's modest predictions:

Chip complexity unsustainable: Thread level //ism >> ILP (reinvention of RISC for MP)
I/O metric of interest will be IP bandwidth divided by feature size
Biggest challenge: rsrc management (defining right abstraction about rsrc consumption)
VM's are about memory, not ISA: JIT compiling allows memory optimizations
Energy efficiency will be differentiator: joules/op
Error recovery pipelines commonplace, critical by 2005, imminent product liability suit
Leading cause of unavailability will be security breaches
Dataflow will be rediscovered by 2010 (b/c synchronization and latency will remain fundamental)

SLA discussion: What gets measured in SLA's for networks? (Ask Peter D., Udi) What coudl get measured for network apps? (Ask Inktomi) Oracle, Peoplesoft, other outsourced ASP's; B2B non-internet, airlines, FedX, etc. What if you build your system so that the properties you want to SLA are measured? eg Inktomi queries are designed to have hard upper bound on latency. Financial trading services? (SLAAPP: SLA's for Applications)

Eric A: introduce randomness/nondeterminism in algorithms systematically to avoid deterministic worst-case behaviors. What kinds of nondeterminism can be induced?

Timing or interleaving of distributed events
Data "randomization"
periodicity
random selection of n-out-of-k sources, cross checkers, etc.

Combine with assertions! Do it a couple of different ways that are different in nondeterministic ways, and cross check them. ("Repeatable nondeterminism" is the phrase of the day). More outfits are shipping production s/w with debugging and assertions turned ON (eg Pathfinder, WinXP) instead of optimizations/stripped binaries. (Position paper: "-O considered harmful")

Tim Chou, Oracle

- Recovery oriented computing; real data; wisdom
- jim gray's views
- shift to services model: a different slice thru software

Oracle going to 100% online model. Recent release of 11i with a bunch of new features (process mfg & devel) - they did 100 patches in 5 days after it went online! Never could've done that offline.

How does this change the way you write s/w? Right now not much. Soon, config will become simpler. (But remember Yahoo - managed as independent properties w/narrow API's) Also, "fix once, fix everywhere" due to shared code will become more common. We estimate that Oracle can offer the same funcitnality at 1/10 the cost with this model.

With online model, the buck stops with you. Old world: 1/3, 1/6, 1/4, 1/4 (spec, code, unit test, sys test). Now, when you deploy online, you have many more people immediately testing new features, and no disparity in who's using which version - you fix the bug and it is fixed for everyone. "The customer is doing the QA, at scale." Tools are different, techniques are different. The "new kids" have engineered this from ground zero.

Tandem process pairs faded into obscurity b/c there was no programming model (other than smart people) for exploiting it; eventually transactions became the unit of recovery/protection/etc. (And xact recovery times got fast enough that nonstop was less relevant.)

Why do systems fail? 1) Humans/operations. 2) change vs. reliability. Churn is bad. Don't touch it. this is OK for slow-moving, non-24x7, large systems (daily bank reconciliations, shift mfg., etc). In the real world, churn is a business reality.

Ravi Iyer's students instrumented spoolers on a Tandem? system to see how often process 2 takes over from process 1. Even though the code is the same!! So there must have been environemntal factors (Heisenbugs!) - look in IEEE Computer. Benefit of rebooting vs checkpointing is lower likelihood of restarting from a corrupted state.

Communication apps are nothing like enterprise - maybe they're a better target for non-ACID stuff. (Synchronous and maybe async) Fuzzier; may tolerate losses better (integrity); may tolerate inconsistency better; may be latency sensitive. The iRoom is an example of this category. Especially for multi-peer apps, the model is more complex than just client/server. And availability is more important - degraded quality now is arguably better than perfect quality later.

Shuki Bruck (Caltech) - some recent work on reliability? - most of this and traditional work on reliability focuses on "reliable by construction/design", ie byzantine agreement. We're also about "reliable during operations".

Oracle doesn't use app-level SLA's! They have a carte-blanche "quality guarantee" - you get 20% of monthly fee back if you're unhappy for any reason. SLA's are too impractical, hard to figure out what went wrong, esp. for the customer. But smaller companies would have trouble pulling this off. They are bad for basing a business relationship. Are they good for just monitoring in a quantitative manner? --> Don't use them for enforcement -- but maybe use it as the basis of monitoring. Even if you can't enforce it then at least you know what you are monitoring.