07/01/01 Notes from EASY and/or DSN 2001

A. Avizienis, Immune System Paradigm for F/T systems

An agenda:

  1. Clearly define a class of cluster based services using quantitative model that captures H/Y, result fidelity (DQ), consistency (TACT, SSTP), cluster granularity, functionality (queries, record updates, reconciliation, etc), utility of recovery vs. timeliness of recovery. Model should be able to characterize workload as well: are some transacts/queries more important than others?  What is the cost of losing a trans. vs. delaying reply vs. decreasing fidelity?
  2. Indicate how some specific services are captured by model
  3. Indicate the extent to which RR can be employed by model as a fn of the parameters
  4. Get industrial data on what kinds of faults occur in this app class
  5. show how to use RR to mitigate faults, distribute them more evenly using randomness (ie avoid worst cases), increase uptime, etc

 Background reading: D.S. Katz and P.L. Springer, Development of a spaceborne embedded cluster, in IEEE Cluster 2000, Chemnitz, Germany

Breakout session with Brendan Murphy

Gap between availability as measured inside vs outside your boundary of control, eg inside your cluster vs your cluster as seen by the aggregate of users.  Makes it hard if you have things like geographically-distributed secondary: if a network failure (vs a server failure) diverts users to that secondary, your availability is effectively primary AND secondary (not primary OR secondary).  It's all in who controls the failover.    

There are (implicit) SLA's every time you cross a control/administrative boundary.  Make the boundaries well defined and the SLA's well defined (can be probabilistic); design measurability into the interfaces/SLA's, so we can identify which components aren't living up to their SLA's.  Can also syntehsize end-to-end availability from this, in addition to measuring it.

Duetschs's fallacies ("transparency considered harmful"): (from Greg Papadopoulos ISCA/DSN keynote)

  1. The network is reliable
  2. latency is 0
  3. b/w is infinite
  4. network is secure
  5. topologyu doesn't change
  6. there is a netowkr admin
  7. transport is free
  8. network is homogeneous

Corollaries:

  1. network unplannable./unknownable
  2. sys arch driven from edege
  3. component decomp is the "thermodynamic certainty" of the industry

Greg's  modest predictions:

 

SLA discussion:  What gets measured in SLA's for networks?  (Ask Peter D., Udi) What coudl get measured for network apps?  (Ask Inktomi)  Oracle, Peoplesoft, other outsourced ASP's; B2B non-internet, airlines, FedX, etc.  What if you build your system so that the properties you want to SLA are measured?  eg Inktomi queries are designed to have hard upper bound on latency.  Financial trading services?  (SLAAPP: SLA's for Applications)

Eric A: introduce randomness/nondeterminism in algorithms systematically to avoid deterministic worst-case behaviors.  What kinds of nondeterminism can be induced?

Combine with assertions!  Do it a couple of different ways that are different in nondeterministic ways, and cross check them.  ("Repeatable nondeterminism" is the phrase of the day).  More outfits are shipping production s/w with debugging and assertions turned ON (eg Pathfinder, WinXP) instead of optimizations/stripped binaries.  (Position paper: "-O considered harmful")

Tim Chou, Oracle

- Recovery oriented computing;  real data; wisdom
- jim gray's views
- shift to services model: a different slice thru software

Oracle going to 100% online model.  Recent release of 11i with a bunch of new features (process mfg & devel) - they did 100 patches in 5 days after it went online!  Never could've done that offline.

How does this change the way you write s/w?  Right now not much.  Soon, config will become simpler.  (But remember Yahoo - managed as independent properties w/narrow API's)  Also, "fix once, fix everywhere" due to shared code will become more common.  We estimate that Oracle can offer the same funcitnality at 1/10 the cost with this model.

With online model, the buck stops with you.  Old world: 1/3, 1/6, 1/4, 1/4 (spec, code, unit test, sys test).  Now, when you deploy online, you have many more people immediately testing new features, and no disparity in who's using which version - you fix the bug and it is fixed for everyone.  "The customer is doing the QA, at scale."  Tools are different, techniques are different.  The "new kids" have engineered this from ground zero.

Tandem process pairs faded into obscurity b/c there was no programming model (other than smart people) for exploiting it; eventually transactions became the unit of recovery/protection/etc.  (And xact recovery times got fast enough that nonstop was less relevant.)

Why do systems fail?  1) Humans/operations.  2) change vs. reliability.  Churn is bad.  Don't touch it.  this is OK for slow-moving, non-24x7, large systems (daily bank reconciliations, shift mfg., etc).  In the real world, churn is a business reality.

Ravi Iyer's students instrumented spoolers on a Tandem? system to see how often process 2 takes over from process 1.  Even though the code is the same!!  So there must have been environemntal factors (Heisenbugs!)  - look in IEEE Computer.  Benefit of rebooting vs checkpointing is lower likelihood of restarting from a corrupted state.

Communication apps are nothing like enterprise - maybe they're a better target for non-ACID stuff.  (Synchronous and maybe async)  Fuzzier; may tolerate losses better (integrity); may tolerate inconsistency better; may be latency sensitive.  The iRoom is an example of this category.  Especially for multi-peer apps, the model is more complex than just client/server.  And availability is more important - degraded quality now is arguably better than perfect quality later.

Shuki Bruck (Caltech) - some recent work on reliability?  - most of this and traditional work on reliability focuses on "reliable by construction/design", ie byzantine agreement.  We're also about "reliable during operations".

Oracle doesn't use app-level SLA's!  They have a carte-blanche "quality guarantee" - you get 20% of monthly fee back if you're unhappy for any reason.  SLA's are too impractical, hard to figure out what went wrong, esp. for the customer.  But smaller companies would have trouble pulling this off.  They are bad for basing a business relationship.  Are they good for just monitoring in a quantitative manner?  --> Don't use them for enforcement -- but maybe use it as the basis of monitoring.  Even if you can't enforce it then at least you know what you are monitoring.