Notes from E-commerce Dependability Workshop at DSN 2002

Run by Lisa Spainhower

Notes by Armando Fox

 Phil Koopman and John deVale, Robust software: no more excuses

Automatically "harden" software by directing compiler to insert runtime guards, so that exceptional conditions are more likely to be detected before they lead to an operation that results in going to an unrecoverable state (ie before mutating important state, stomping on a memory structure,etc)  This is done by validating inputs on the way in to a function (eg check pointer before doing a memmove or memchr).  you provide specific functions that validate your own data structures, eg check integrity of a data structure, invariants across sets of values, etc.  to reduce performance penalty, cache the fact that specific values have been validated; whenevre any event occurs that might change already-validated values, flush the cache.  net 5-10% performance penalty with caching in place, until cache thrashing.  Observation: using caching hides cost of added validation, and advances in microarchitecture could help hide cost of added instructions in the future.

 Complementary paper by Chris Fetzer and ?? Xiao, AT&T Research, on wrapping libraries for robustness. interested in "crash failures".  It's for wrappign library functions to check argument validities (same flavor as above), but they can largely automate the generation of the wrappers, using a combination of examining header files and injecting faults ("prototype extraction" is their name of the combined process).  about 1/3 of funcs in libc had prototypes in man page, another 1/3 they found prototypes manually using grep.  also found 10% of the man pages specified the wrong include file!  All their checks are accurate (no false negatives), but not complete.  Note that inaccuracy may break otherwise-correct app semantics, whereas incompletness just produces less robustness.  basicaly, to avoid having to know semantics of each possible argument value for a given function, the test case generator groups argument values into disjoint equivalence classes.  if all values in a class are incorrect, that class is rejected by the wrapper; if all values are correct, the class is accepted; if some of each, also accepted.  Multi-argument funcs are handled by an extension that allows the allowable classes for one arg to depend on the value of the other; they then propagate these constraints thru a "type hierarchy" graph and use the results to guide fault-injection to find the boundary values that the wrappers will check.  They used Ballista to test a robustified Gnu Libc.  This was an interesting paper worth a full read and discusion, perhaps in conjunction w/the previous one and a representative Ballista paper or brief review of how it works...maybe the Fig folks could lead this discussion at santa cruz?

 Eric Siegel (Keynote): managing ebiz systems to increase availability

 Web is not end-to-end but "end-to-N": packet routes are different each way in a roudntrip, even for a packet and its ack, due to economics/legal of peering arrangements.  at a higher layer, a single web page view touches many servers (images, doubleclick, etc), each of which may be replicated, akamaized, etc, but if they don't all work, the user blames the operator of the main page.  if you're that operator, how can you manage this when most of the packet routing, replication decisions, etc are beyond your control?

He showed a nice time-series-type slide that breaks down all the delays in downloading NYtimes home page (which includes akamai and doubleclick), and it shows the efects of app server in its "fail stutter" mode: as it starts to overload, it starts putting requests in the "backlog queue", so delays for clients get longer and longer; when backlog queue fills, app server stops accepting connections and at that point often just fails (since some clients have pending connections, they see a broken-image icon as that server instance goes away).

Other problems include: DNS badness/human errors, missing file from distributed location (updates haven't propagated to all replicas), misbheaving servers hidden behind load-distribution device, DB/xact failure, etc.  Moral of the story: client is the only endpoint in common, will have to move some recovery to the client.  Is there an easy way to do an "detect failure of idempotent http reqs and automatically Refresh" as a browser plug-in?

 One way to cut MTTR: convince the correct team that a particualr failure is their problem; they have specialized knowledge and tools.  In other words - rapid diagnostic is essnetial to lowering mttr,. especially when human interaction will be required for recovery.

 Set "alarm thresholds" diffedrently and different times.  Eg an increase of 2sec latency may just be due to "noise" in the morning peak hours, but that same increase in the middle of the night under no load may be a yellow flag.

 Test abandonment behavior: if you're having a problem, users will abandon site, which affects load testing of "later" pages.  Keynote also measures "satisfaction", based on response time.

 "Concurrent users" != presented load!  because of abandonment (or if system is working well, because of task completion), load may DECREASE.  Also, no notion of sessions, so what does "concurrent" mean?  Ex: if 10 users come in, 9 get frustrated and abandon, the remaining 1 completes his session, it's indistinguishable from 10 "concurrent users" over course of a session.  need to look at arrival rates and distinguish specfic users; concurrent users is more like an output than an input.  So what metric do they use??

 Some systems may slow down or break entirely after sustained period of high load (SW aging?).

In testing dialup systems, location is not improtant since dialups strill contribute most of the latency!  Emulation of dialups doesn't work because modem compression effects are unpredictable - can't even compare same website on different modems sometimes.  Throttling routers to 56k does not simulate modems well.  There' sa paper on Keynote website about this.

 "Benchmark objcts" for diagnosis:

Use web logs to select representativ "agents" for your benchmark

Compare your competitors' sitest to "aggregated indices" that are available (ie is this our problem or everyone's problem?)

Is the problem in your piece of the Internet or elsehwere?  Is TCP Connect clearly worse for your site?  (this si the "purest measuremnt" of comminications connectivity)

Peering problems in the Internet: peering delays are a more complete and correct picture of peering delays than traceroute, since packet routes may differ in each direction and same TCP connection may take different routes at differnt times.  Usually, measure a large number of TCP Connect transactions to the same destination, and compare the delays at various peering boundaries along its path.

 Internet loads are very heavy tailed!  Don't use arithmetic means to measure anything!  Also StDev is milesleading in measuring heavy tail data (since goes as square of badness) - one outlier may counterbalance 10,.000 good "typical" measurements.  Conf intervals dont' work well either unless mode is narrow and/or curve is symmetric (as opposed to heavy tailed).  So they use "geometric deviation" = 10^(variance(log(x_i)).  log-space has skews about 8x smaller than ....  - they have a whitepaper on this on Keynote site: "Keynote data accuracy and statistical analysis for performance trending and service level mgt/SLA".   They have a full tiem PhD statistician (Chris Overton, he's attended some ROC retreats, we should get him to come back and give a stats talk).

Corollary 1: "design for predictability" will become a major engineering target!  Keynote has an ancillary business where they approach a potential customer jointly with an insurance underwriter who is prepared to insure against some failures based on Keynote's statistical models.  Security guys want this but don't know how to measure risk yet; Keynote has solved this for certain kinds of SLA's.

Corollary 2: SLA's need to catpreu time to recovery and number of outages.  At recent ISPCON, ISP's said their worst nightmare was losing uplink ("Once a month, users mad; twice a month, a nightmare; three times, I'm out of business") so their uplink choices are not based (primarily) on cost or even high perf.  Current SLA's don't capture this.

Carl Hutzler, AOL, Sr. Manager Mailbox Operations

[email protected]

Matt  Kersner(?), Windows Reliability Team

Notes from a conversation with Ravi Iyer and colleagues

I finally got some of Ravi Iyer's time at DSN, along with some other colleagues (see below), some of their students, and Steve Lumetta.  It was hard to actually get a good chunk of time but once we did the discussion was very animated.  Here are some high order bits. (other than this discussion and lisa's workshop, DSN was not impressive.)

"Ravi" = Ravi Iyer, "kishor" = Kishor Trivedi (Duke) who did a lot of the SW rejuvenation work, "rafi" = Rafi Some, JPL manager/technologist

I tried to explain the area ROC is looking at (inet services) and why they're interesting -among other things, there are services where users may be willing to tolerate temporary lapses in performance or other temporary degradation as being preferable to total unavailability, and sometimes this ability to trade off can be translated into simpler engineering mechanisms for recovery.

Ravi:

Rafi:

All: