Notes from E-commerce Dependability Workshop at DSN 2002

Run by Lisa Spainhower

Notes by Armando Fox

Phil Koopman and John deVale, Robust software: no more excuses

Automatically "harden" software by directing compiler to insert runtime guards, so that exceptional conditions are more likely to be detected before they lead to an operation that results in going to an unrecoverable state (ie before mutating important state, stomping on a memory structure,etc) This is done by validating inputs on the way in to a function (eg check pointer before doing a memmove or memchr). you provide specific functions that validate your own data structures, eg check integrity of a data structure, invariants across sets of values, etc. to reduce performance penalty, cache the fact that specific values have been validated; whenevre any event occurs that might change already-validated values, flush the cache. net 5-10% performance penalty with caching in place, until cache thrashing. Observation: using caching hides cost of added validation, and advances in microarchitecture could help hide cost of added instructions in the future.

Complementary paper by Chris Fetzer and ?? Xiao, AT&T Research, on wrapping libraries for robustness. interested in "crash failures". It's for wrappign library functions to check argument validities (same flavor as above), but they can largely automate the generation of the wrappers, using a combination of examining header files and injecting faults ("prototype extraction" is their name of the combined process). about 1/3 of funcs in libc had prototypes in man page, another 1/3 they found prototypes manually using grep. also found 10% of the man pages specified the wrong include file! All their checks are accurate (no false negatives), but not complete. Note that inaccuracy may break otherwise-correct app semantics, whereas incompletness just produces less robustness. basicaly, to avoid having to know semantics of each possible argument value for a given function, the test case generator groups argument values into disjoint equivalence classes. if all values in a class are incorrect, that class is rejected by the wrapper; if all values are correct, the class is accepted; if some of each, also accepted. Multi-argument funcs are handled by an extension that allows the allowable classes for one arg to depend on the value of the other; they then propagate these constraints thru a "type hierarchy" graph and use the results to guide fault-injection to find the boundary values that the wrappers will check. They used Ballista to test a robustified Gnu Libc. This was an interesting paper worth a full read and discusion, perhaps in conjunction w/the previous one and a representative Ballista paper or brief review of how it works...maybe the Fig folks could lead this discussion at santa cruz?

Eric Siegel (Keynote): managing ebiz systems to increase availability

Web is not end-to-end but "end-to-N": packet routes are different each way in a roudntrip, even for a packet and its ack, due to economics/legal of peering arrangements. at a higher layer, a single web page view touches many servers (images, doubleclick, etc), each of which may be replicated, akamaized, etc, but if they don't all work, the user blames the operator of the main page. if you're that operator, how can you manage this when most of the packet routing, replication decisions, etc are beyond your control?

He showed a nice time-series-type slide that breaks down all the delays in downloading NYtimes home page (which includes akamai and doubleclick), and it shows the efects of app server in its "fail stutter" mode: as it starts to overload, it starts putting requests in the "backlog queue", so delays for clients get longer and longer; when backlog queue fills, app server stops accepting connections and at that point often just fails (since some clients have pending connections, they see a broken-image icon as that server instance goes away).

Other problems include: DNS badness/human errors, missing file from distributed location (updates haven't propagated to all replicas), misbheaving servers hidden behind load-distribution device, DB/xact failure, etc. Moral of the story: client is the only endpoint in common, will have to move some recovery to the client. Is there an easy way to do an "detect failure of idempotent http reqs and automatically Refresh" as a browser plug-in?

One way to cut MTTR: convince the correct team that a particualr failure is their problem; they have specialized knowledge and tools. In other words - rapid diagnostic is essnetial to lowering mttr,. especially when human interaction will be required for recovery.

Set "alarm thresholds" diffedrently and different times. Eg an increase of 2sec latency may just be due to "noise" in the morning peak hours, but that same increase in the middle of the night under no load may be a yellow flag.

Test abandonment behavior: if you're having a problem, users will abandon site, which affects load testing of "later" pages. Keynote also measures "satisfaction", based on response time.

"Concurrent users" != presented load! because of abandonment (or if system is working well, because of task completion), load may DECREASE. Also, no notion of sessions, so what does "concurrent" mean? Ex: if 10 users come in, 9 get frustrated and abandon, the remaining 1 completes his session, it's indistinguishable from 10 "concurrent users" over course of a session. need to look at arrival rates and distinguish specfic users; concurrent users is more like an output than an input. So what metric do they use??

Some systems may slow down or break entirely after sustained period of high load (SW aging?).

In testing dialup systems, location is not improtant since dialups strill contribute most of the latency! Emulation of dialups doesn't work because modem compression effects are unpredictable - can't even compare same website on different modems sometimes. Throttling routers to 56k does not simulate modems well. There' sa paper on Keynote website about this.

"Benchmark objcts" for diagnosis:

a GIF ona white box that compares diferent production servers
special TCP connections that route only thru one ISP, to test specific network paths

Use web logs to select representativ "agents" for your benchmark

Compare your competitors' sitest to "aggregated indices" that are available (ie is this our problem or everyone's problem?)

Is the problem in your piece of the Internet or elsehwere? Is TCP Connect clearly worse for your site? (this si the "purest measuremnt" of comminications connectivity)

Peering problems in the Internet: peering delays are a more complete and correct picture of peering delays than traceroute, since packet routes may differ in each direction and same TCP connection may take different routes at differnt times. Usually, measure a large number of TCP Connect transactions to the same destination, and compare the delays at various peering boundaries along its path.

Internet loads are very heavy tailed! Don't use arithmetic means to measure anything! Also StDev is milesleading in measuring heavy tail data (since goes as square of badness) - one outlier may counterbalance 10,.000 good "typical" measurements. Conf intervals dont' work well either unless mode is narrow and/or curve is symmetric (as opposed to heavy tailed). So they use "geometric deviation" = 10^(variance(log(x_i)). log-space has skews about 8x smaller than .... - they have a whitepaper on this on Keynote site: "Keynote data accuracy and statistical analysis for performance trending and service level mgt/SLA". They have a full tiem PhD statistician (Chris Overton, he's attended some ROC retreats, we should get him to come back and give a stats talk).

Corollary 1: "design for predictability" will become a major engineering target! Keynote has an ancillary business where they approach a potential customer jointly with an insurance underwriter who is prepared to insure against some failures based on Keynote's statistical models. Security guys want this but don't know how to measure risk yet; Keynote has solved this for certain kinds of SLA's.

Corollary 2: SLA's need to catpreu time to recovery and number of outages. At recent ISPCON, ISP's said their worst nightmare was losing uplink ("Once a month, users mad; twice a month, a nightmare; three times, I'm out of business") so their uplink choices are not based (primarily) on cost or even high perf. Current SLA's don't capture this.

Carl Hutzler, AOL, Sr. Manager Mailbox Operations

[email protected]

Today: biggest 'growth' in Spam and other illegiatimte uses (DoS, virus distribution), about 400M msgs/day handled total
100M accts, 400M msgs/day peak, email message database completely "turned over" every 9 days, 26M recipients/hr (record during a 400M email day), 100+ TB stoprage, 2x system growth a year (in storage, almost in capacity; most elements of growth scale linearly, except attachments since they get a lot of spam now), 55000TPS for complex, 3.5B total recipients. (one "recipient" = one AOL user receiving a particular message, so user accounts is recipients with duplicates removed). Even when most AOL members aren't online it doesn't affect the workload of their email server, since >80% of email handled is coming from the outside internet.
System arch: mailboxes (headers and metadata) stored in Tandem+SQL; mail contents in Sybase; attachmenets in Informix as blobs; embedded images in a unix FS (soon to be folded into Informix); legacy email (AOL <=3.0) on a separate Stratus system that is slowly going away.
They use DB's for recovery, even from human errors and admin/install errors. They store the email metadata as structured data in the mailbox DB; not just static stuff like headers, but dynamic stuff lke who has it been forwarded to, has an attachment been forwarded/shared, etc. So they can basically recover most of the "history" of an email message by looking at the tables in the mailboxDB! Cool. With the # of users they have, the ability to do this has huge economy of scale, ie if someone forwards same attachment to N other AOL users, they only store 1 copy of the attachment and annotate N rows in the mailboxDB.
Each message averages 1.5-3 recipients...this average does include Spam effects, but as Eric Siegel pointed out, spammers have become smarter and no longer send the same message to a bunch of people, instead sending a large number of small messages with distinct recipient lists. Would be interesting to compare their Spam thresholding technology with Yahoo's SpamGuard.
Availability for front-end/gateway apps: mostly stateless, lots of servers w/failover. We probably knew this, but this is a published reference: Carl Hutzler, Challenges of Operating the world's largest 24x7x365 email system, Workshop on Dependability of E-commerce Systems at DSN 2002, Washington, DC.
Attachmenets use RAID, message contents replicated, all on commodity hardware (mostly Sun & HP, looking at Linux); so neither of those kinds of outages knock out capacity for a whol.e single user. Single pt of failure is mailbox, hence Tandem. They do partial installs of code (affect subset of users) to prove changes before complete deplyoment (similar to Akamai and probably Yahoo).
In addition to failure monitoring and alarms internally, they also have "simulated users" who dial in like real users, to do end-to-end checks of email systems. They have very tight thresholds because above those thresholds they get very fast queueing buildup and congestion delays that can cause cascades, so they immediately fail out suspect components rather than letting them slow down the whole system.
Human error is a relatively minor source of down-minutes, because they have very senior staff. Often, a "simple" HW failure turns out to cause more downtime than it should because the OS/SW doesn't handle it well. Once in a great while, a deterministic bug will take out both processes in the Tandem process pair.
Typical outage affects 1.5% members for 1-2 hrs; average less than 4/year. 1-2 hr rcovery time is kind of the "fixed cost" to diagnose and get someone on call to fix the failure. Human errors tend to take longer to fix, but are less visible to end users because of redundancy elsewhere. Root cause of typical failure is HW failure not handled gracefully by SW.
THeir minimum number of users online is now 1.3M (~4am in the US, but =4pm in Asia).
Future needs: faster/more reliable data migration (they can move sections of DB now, but it takes many hours durign which some users will have limited access to email features); better testing of SW (both ext. and their own); partial upgrade capability; better change mgt techniques; planning for disaster recovery.
During panel session, mentioned that "MTTR is the driving factor for how painful it is to handle a failure." Hot -pluggable HW are easy to handle; SW bugs are hard because if deterministic, they require (eventually) a full rollout of the patch; HW bugs that don't result in graceful failures are also bad.
Measure of an "outage": from # of transactions requested against Tandem mailbox server via the front-ends, hey count how many cannot reach Tandem, don't return within 30-sec timeout, or fail. This indirectly captures end-user impact because the #of email trans is proprotional to # of users. An "outage" is a sustained burst (for more than about 1 minute) of timeouts or failures. Another site uses "user x outage-minutes". AOL is close to four-nines (99.98?) w/r/t this measure.
Churn for features: new stuff every 2-3 weeks. This goes into the Webby stuff, not the Tandem! it changes very slowly and takes a long time to upgrade (there are nearly 100? machines that must be upgraded, takes about a day to do each)

Matt Kersner(?), Windows Reliability Team

He alluded to the facility James talked about, where minidumps and reason codes are sent to MS after unplanned reboots.
Q: are you working on optimizing reboot time, given that people do proactive orderly reboots? A: to some extent.
Q: how about optimizing unplanned-reboot time, ie time spent rebuilding data structures, checking FS, etc? A: a lot of unplanned-reboot time is spinnig up RAIDs, etc, and then application-level recovery.
Slide claim: "customers can achieve 99.99+% availability on Windows platform today" - .Net server faeutres will make this "even easier" OK, so the onus is on ROC to demonstrate there is something fishy about this measurement...under what conditions is it taken? (I didn't have time to ask, but another audience member I talked to afterward said it's taken with conservative/small installations of hardened apps, since a lot of Windows downtime seems to be app-specific)
This would be the right person to invite to a future ROC retreat...Brendan Murphy may be able to twist his arm a bit to let someone stay up at MS for a few weeks and look at sanitized data collected from Windows rebooting, the stuff that James Hamilton was talking about.

Notes from a conversation with Ravi Iyer and colleagues

I finally got some of Ravi Iyer's time at DSN, along with some other colleagues (see below), some of their students, and Steve Lumetta. It was hard to actually get a good chunk of time but once we did the discussion was very animated. Here are some high order bits. (other than this discussion and lisa's workshop, DSN was not impressive.)

"Ravi" = Ravi Iyer, "kishor" = Kishor Trivedi (Duke) who did a lot of the SW rejuvenation work, "rafi" = Rafi Some, JPL manager/technologist

I tried to explain the area ROC is looking at (inet services) and why they're interesting -among other things, there are services where users may be willing to tolerate temporary lapses in performance or other temporary degradation as being preferable to total unavailability, and sometimes this ability to trade off can be translated into simpler engineering mechanisms for recovery.

It was a challenge to explain the salient features of the services we're looking at (inet services) and what makes them an interesting area. i think the two commiunities are really used to looking at VERY different apps: our world of communication-centric, logically-client-server apps where hard guarantees are rare is NOT what they normally deal with. we have a lot of cross-education still to do.
Rafi made cynical comments about how if all we want to do is something that makes *some* improvement for *some* customer who isn't sophisticated enough to really measure how well they are doing anyway, and therefore get ourselves continued funding, then this stuff may be a good idea. Eventually retracted his comment under severe duress from me, or claimed it was taken in the wrong way, though I thought it was pretty unequivocal.

Ravi:

it's "easy" to get 2 or 3 9's. the "entry level" techniques you allude to (heartbeats, restart, etc) can get you there. getting to 4 or 5 9's is much harder and more subtle.
it first requires measuring how many 9's you have and where the weaknesses are that can be strengthened to get better. (I agree, and was trying to explain that how this is measured--performability, availability, availaility under degraded performance, etc.--is exactly one of the interesting 'degrees of freedom' of ROC.)
I heard thru grapevine that during one of the post-speaker q&a sessions, ravi made a comment about how people will keep publishing in fault injection because it's a good way to get tenure. grapevine wasn't clear on whether ravi intended this comment to be cynical or not - i hope he did. i didn't attend that session so i don't know.

Rafi:

a big problem with this whole "internet services" space is that they're fuzzy to measure. theres no model of what they do - no computation model, no model that captures different criticality of data, etc. Why don't you instead look at financial type apps or similar - typically those have very hard bounds on correctness, performance, and therefore dependability, so it will make your job easier.
I said I agreed and was trying to develop a stricter model for such services. Rafi said if you do succeed in coming up with a credible model for the other kind of services (where degraded performance, longer response times, etc can sometimes be traded for other properties, or can be acceptable to users on an occasional basis), AND if you can get that model accepted by the Big Players and have it become a de facto standard, you win big and may even get recognized in the community for having made a meanigful contribution...
...but it's really hard and you'll probably spend a lot of time banging your head against it, unless you have a way to concurrently do prototyping/sanity checking/whatever instead of just emerging from a cave after 3 years with a model that nobody wants or buys into, and this is not a good way to get tenure. (yikes...that's the second person I've heard this from)
if you're going to proceed anyway, keep in mind that in trying to come up w/a model whose primitives capture a usefully large class of services, that's comparable to coming up with a benchmark that covers those services/primitives.

All:

agreed that we don't know enough about each other's prior work, don't go to enough of same conferences, still too much pariochialism. Ravi/Rafi suggested we do a "small workshop" around ROC. I told them that EASY is already scheduled; they looked puzzled until Ravi looked at CFP and realized Steve Lumetta was co-organizer. ("Oh, you mean Steve Lumetta's dependabiltiy workshop!") Ravi and Kishor will make sure at least some of their students attend EASY and similar stuff. I will ping them each by email as EASY approaches to remind them to send students.