Notes from E-commerce Dependability Workshop at DSN 2002
Run by Lisa Spainhower
Notes by Armando Fox
Phil Koopman and John deVale, Robust software: no more excuses
Automatically "harden" software by directing compiler to insert
runtime guards, so that exceptional conditions are more likely to be detected
before they lead to an operation that results in going to an unrecoverable state
(ie before mutating important state, stomping on a memory structure,etc)
This is done by validating inputs on the way in to a function (eg check pointer
before doing a memmove or memchr). you provide specific functions that
validate your own data structures, eg check integrity of a data structure,
invariants across sets of values, etc. to reduce performance penalty,
cache the fact that specific values have been validated; whenevre any event
occurs that might change already-validated values, flush the cache. net
5-10% performance penalty with caching in place, until cache thrashing. Observation:
using caching hides cost of added validation, and advances in microarchitecture
could help hide cost of added instructions in the future.
Complementary paper by Chris Fetzer and ?? Xiao, AT&T
Research, on wrapping libraries for robustness. interested in "crash
failures". It's for wrappign library functions to check argument
validities (same flavor as above), but they can largely automate the generation
of the wrappers, using a combination of examining header files and injecting
faults ("prototype extraction" is their name of the combined process).
about 1/3 of funcs in libc had prototypes in man page, another 1/3 they found
prototypes manually using grep. also found 10% of the man pages specified
the wrong include file! All their checks are accurate (no false
negatives), but not complete. Note that inaccuracy may break
otherwise-correct app semantics, whereas incompletness just produces less
robustness. basicaly, to avoid having to know semantics of each possible
argument value for a given function, the test case generator groups argument
values into disjoint equivalence classes. if all values in a class are
incorrect, that class is rejected by the wrapper; if all values are correct, the
class is accepted; if some of each, also accepted. Multi-argument funcs
are handled by an extension that allows the allowable classes for one arg to
depend on the value of the other; they then propagate these constraints thru a
"type hierarchy" graph and use the results to guide fault-injection to
find the boundary values that the wrappers will check. They used Ballista
to test a robustified Gnu Libc. This
was an interesting paper worth a full read and discusion, perhaps in conjunction
w/the previous one and a representative Ballista paper or brief review of how it
works...maybe the Fig folks could lead this discussion at santa cruz?
Eric Siegel (Keynote): managing ebiz systems to increase
availability
Web is not end-to-end but "end-to-N": packet routes are
different each way in a roudntrip, even for a packet and its ack, due to
economics/legal of peering arrangements. at a higher layer, a single web
page view touches many servers (images, doubleclick, etc), each of which may be
replicated, akamaized, etc, but if they don't all work, the user blames the
operator of the main page. if you're that operator, how can you manage
this when most of the packet routing, replication decisions, etc are beyond your
control?
He showed a nice time-series-type slide that breaks down all the delays in
downloading NYtimes home page (which includes akamai and doubleclick), and it
shows the efects of app server in its "fail stutter" mode: as it
starts to overload, it starts putting requests in the "backlog queue",
so delays for clients get longer and longer; when backlog queue fills, app
server stops accepting connections and at that point often just fails (since
some clients have pending connections, they see a broken-image icon as that
server instance goes away).
Other problems include: DNS badness/human errors, missing file from
distributed location (updates haven't propagated to all replicas), misbheaving
servers hidden behind load-distribution device, DB/xact failure, etc. Moral
of the story: client is the only endpoint in common, will have to move some
recovery to the client. Is there an easy way to do an "detect failure
of idempotent http reqs and automatically Refresh" as a browser plug-in?
One way to cut MTTR: convince the correct team that a particualr
failure is their problem; they have specialized knowledge and tools. In
other words - rapid diagnostic is essnetial to lowering mttr,. especially when
human interaction will be required for recovery.
Set "alarm thresholds" diffedrently and different times.
Eg an increase of 2sec latency may just be due to "noise" in the
morning peak hours, but that same increase in the middle of the night under no
load may be a yellow flag.
Test abandonment behavior: if you're having a problem, users will
abandon site, which affects load testing of "later" pages.
Keynote also measures "satisfaction", based on response time.
"Concurrent users" != presented load! because of
abandonment (or if system is working well, because of task completion), load may
DECREASE. Also, no notion of sessions, so what does "concurrent"
mean? Ex: if 10 users come in, 9 get frustrated and abandon, the remaining
1 completes his session, it's indistinguishable from 10 "concurrent
users" over course of a session. need to look at arrival rates and
distinguish specfic users; concurrent users is more like an output than an
input. So what metric do they
use??
Some systems may slow down or break entirely after sustained period of
high load (SW aging?).
In testing dialup systems, location is not improtant since dialups strill
contribute most of the latency! Emulation of dialups doesn't work because
modem compression effects are unpredictable - can't even compare same website on
different modems sometimes. Throttling routers to 56k does not simulate
modems well. There' sa paper on Keynote website about this.
"Benchmark objcts" for diagnosis:
- a GIF ona white box that compares diferent production servers
- special TCP connections that route only thru one ISP, to test specific
network paths
Use web logs to select representativ "agents" for your benchmark
Compare your competitors' sitest to "aggregated indices" that are
available (ie is this our problem or everyone's problem?)
Is the problem in your piece of the Internet or elsehwere? Is TCP
Connect clearly worse for your site? (this si the "purest
measuremnt" of comminications connectivity)
Peering problems in the Internet: peering delays are a more complete and
correct picture of peering delays than traceroute, since packet routes may
differ in each direction and same TCP connection may take different routes at
differnt times. Usually, measure a large number of TCP Connect
transactions to the same destination, and compare the delays at various peering
boundaries along its path.
Internet loads are very heavy
tailed! Don't use arithmetic means to measure anything! Also
StDev is milesleading in measuring heavy tail data (since goes as square of
badness) - one outlier may counterbalance 10,.000 good "typical"
measurements. Conf intervals dont' work well either unless mode is narrow
and/or curve is symmetric (as opposed to heavy tailed). So they use "geometric
deviation" = 10^(variance(log(x_i)). log-space has skews about 8x
smaller than .... - they have a whitepaper
on this on Keynote site: "Keynote data accuracy and statistical analysis
for performance trending and service level mgt/SLA". They have
a full tiem PhD statistician (Chris
Overton, he's attended some ROC retreats, we should get him to come back
and give a stats talk).
Corollary 1: "design for predictability" will become a major
engineering target! Keynote has an ancillary business where they approach
a potential customer jointly with an insurance underwriter who is prepared to
insure against some failures based on Keynote's statistical models.
Security guys want this but don't know how to measure risk yet; Keynote has
solved this for certain kinds of SLA's.
Corollary 2: SLA's need to catpreu time to recovery and number of
outages. At recent ISPCON, ISP's said their worst nightmare was losing
uplink ("Once a month, users mad; twice a month, a nightmare; three times,
I'm out of business") so their uplink choices are not based (primarily) on
cost or even high perf. Current SLA's don't capture this.
Carl Hutzler, AOL, Sr. Manager Mailbox Operations
[email protected]
- Today: biggest 'growth' in Spam and other illegiatimte uses (DoS,
virus distribution), about 400M msgs/day handled total
- 100M accts, 400M msgs/day peak, email message database completely
"turned over" every 9 days, 26M recipients/hr (record during a
400M email day), 100+ TB stoprage, 2x system growth a year (in storage,
almost in capacity; most elements of growth scale linearly, except
attachments since they get a lot of spam now), 55000TPS for complex, 3.5B
total recipients. (one "recipient" = one AOL user receiving
a particular message, so user accounts is recipients with duplicates
removed). Even when most AOL members aren't online it doesn't affect
the workload of their email server, since >80% of email handled is coming
from the outside internet.
- System arch: mailboxes (headers and metadata) stored in Tandem+SQL; mail
contents in Sybase; attachmenets in Informix as blobs; embedded images in a
unix FS (soon to be folded into Informix); legacy email (AOL <=3.0) on a
separate Stratus system that is slowly going away.
- They use DB's for recovery, even from human errors and admin/install
errors. They store the email metadata as structured data in the
mailbox DB; not just static stuff like headers, but dynamic stuff lke who
has it been forwarded to, has an attachment been forwarded/shared,
etc. So they can basically recover most of the "history" of
an email message by looking at the tables in the mailboxDB!
Cool. With the # of users they have, the ability to do this has
huge economy of scale, ie if someone forwards same attachment to N other AOL
users, they only store 1 copy of the attachment and annotate N rows in the
mailboxDB.
- Each message averages 1.5-3 recipients...this average does
include Spam effects, but as Eric Siegel pointed out, spammers have become
smarter and no longer send the same message to a bunch of people, instead
sending a large number of small messages with distinct recipient
lists. Would be interesting to compare their Spam thresholding
technology with Yahoo's SpamGuard.
- Availability for front-end/gateway apps: mostly stateless, lots of servers
w/failover. We probably knew this, but this is a published
reference: Carl Hutzler, Challenges of Operating the world's largest
24x7x365 email system, Workshop on Dependability of E-commerce Systems at
DSN 2002, Washington, DC.
- Attachmenets use RAID, message contents replicated, all on commodity
hardware (mostly Sun & HP, looking at Linux); so neither of those kinds
of outages knock out capacity for a whol.e single user. Single pt of
failure is mailbox, hence Tandem. They do partial installs of code
(affect subset of users) to prove changes before complete deplyoment
(similar to Akamai and probably Yahoo).
- In addition to failure monitoring and alarms internally, they also have
"simulated users" who dial in like real users, to do end-to-end
checks of email systems. They have very tight thresholds because above
those thresholds they get very fast queueing buildup and congestion delays
that can cause cascades, so they immediately fail out suspect
components rather than letting them slow down the whole system.
- Human error is a relatively minor source of down-minutes, because they
have very senior staff. Often, a "simple" HW failure turns
out to cause more downtime than it should because the OS/SW doesn't handle
it well. Once in a great while, a deterministic bug will take out both
processes in the Tandem process pair.
- Typical outage affects 1.5% members for 1-2 hrs; average less than
4/year. 1-2 hr rcovery time is kind of the "fixed cost" to
diagnose and get someone on call to fix the failure. Human errors tend
to take longer to fix, but are less visible to end users because of
redundancy elsewhere. Root cause of typical failure is HW
failure not handled gracefully by SW.
- THeir minimum number of users online is now 1.3M (~4am in the US, but =4pm
in Asia).
- Future needs: faster/more reliable data migration (they can move sections
of DB now, but it takes many hours durign which some users will have limited
access to email features); better testing of SW (both ext. and their own);
partial upgrade capability; better change mgt techniques; planning for
disaster recovery.
- During panel session, mentioned that "MTTR is the driving factor for
how painful it is to handle a failure." Hot -pluggable HW are
easy to handle; SW bugs are hard because if deterministic, they require
(eventually) a full rollout of the patch; HW bugs that don't result in
graceful failures are also bad.
- Measure of an "outage": from # of transactions requested against
Tandem mailbox server via the front-ends, hey count how many cannot reach
Tandem, don't return within 30-sec timeout, or fail. This indirectly
captures end-user impact because the #of email trans is proprotional to
# of users. An "outage" is a sustained burst (for more than
about 1 minute) of timeouts or failures. Another site uses "user
x outage-minutes". AOL is close to four-nines (99.98?) w/r/t this
measure.
- Churn for features: new stuff every 2-3 weeks. This goes into the
Webby stuff, not the Tandem! it changes very slowly and takes a
long time to upgrade (there are nearly 100? machines that must be upgraded,
takes about a day to do each)
Matt Kersner(?), Windows Reliability Team
- He alluded to the facility James talked about, where minidumps and reason
codes are sent to MS after unplanned reboots.
- Q: are you working on optimizing reboot time, given that people do
proactive orderly reboots? A: to some extent.
- Q: how about optimizing unplanned-reboot time, ie time spent rebuilding
data structures, checking FS, etc? A: a lot of unplanned-reboot time
is spinnig up RAIDs, etc, and then application-level recovery.
- Slide claim: "customers can achieve 99.99+% availability on Windows
platform today" - .Net server faeutres will make this "even
easier" OK, so the
onus is on ROC to demonstrate there is something fishy about this
measurement...under what conditions is it taken? (I didn't have time
to ask, but another audience member I talked to afterward said it's taken
with conservative/small installations of hardened apps, since a lot of
Windows downtime seems to be app-specific)
- This would be the right
person to invite to a future ROC retreat...Brendan Murphy may be able to
twist his arm a bit to let someone stay up at MS for a few weeks and look at
sanitized data collected from Windows rebooting, the stuff that James
Hamilton was talking about.
-
Notes from a conversation with Ravi Iyer and colleagues
I finally got some of Ravi Iyer's time at DSN, along with some other
colleagues (see below), some of their students, and Steve Lumetta. It was
hard to actually get a good chunk of time but once we did the discussion was
very animated. Here are some high order bits. (other than this discussion
and lisa's workshop, DSN was not impressive.)
"Ravi" = Ravi Iyer, "kishor" = Kishor Trivedi (Duke) who
did a lot of the SW rejuvenation work, "rafi" = Rafi Some, JPL
manager/technologist
I tried to explain the area ROC is looking at (inet services) and why they're
interesting -among other things, there are services where users may be willing
to tolerate temporary lapses in performance or other temporary degradation as
being preferable to total unavailability, and sometimes this ability to trade
off can be translated into simpler engineering mechanisms for recovery.
- It was a challenge to explain the salient features of the services we're
looking at (inet services) and what makes them an interesting area. i think
the two commiunities are really used to looking at VERY different apps: our
world of communication-centric, logically-client-server apps where hard
guarantees are rare is NOT what they normally deal with. we have a lot of
cross-education still to do.
- Rafi made cynical comments about how if all we want to do is something
that makes *some* improvement for *some* customer who isn't sophisticated
enough to really measure how well they are doing anyway, and therefore get
ourselves continued funding, then this stuff may be a good idea. Eventually
retracted his comment under severe duress from me, or claimed it was taken
in the wrong way, though I thought it was pretty unequivocal.
Ravi:
- it's "easy" to get 2 or 3 9's. the "entry level"
techniques you allude to (heartbeats, restart, etc) can get you there.
getting to 4 or 5 9's is much harder and more subtle.
- it first requires measuring how many 9's you have and where the weaknesses
are that can be strengthened to get better. (I agree, and was trying
to explain that how this is measured--performability, availability,
availaility under degraded performance, etc.--is exactly one of the
interesting 'degrees of freedom' of ROC.)
- I heard thru grapevine that during one of the post-speaker q&a
sessions, ravi made a comment about how people will keep publishing in fault
injection because it's a good way to get tenure. grapevine wasn't clear on
whether ravi intended this comment to be cynical or not - i hope he did. i
didn't attend that session so i don't know.
Rafi:
- a big problem with this whole "internet services" space is that
they're fuzzy to measure. theres no model of what they do - no computation
model, no model that captures different criticality of data, etc. Why don't
you instead look at financial type apps or similar - typically those have
very hard bounds on correctness, performance, and therefore dependability,
so it will make your job easier.
- I said I agreed and was trying to develop a stricter model for such
services. Rafi said if you do succeed in coming up with a credible
model for the other kind of services (where degraded performance, longer
response times, etc can sometimes be traded for other properties, or can be
acceptable to users on an occasional basis), AND if you can get that model
accepted by the Big Players and have it become a de facto standard, you win
big and may even get recognized in the community for having made a meanigful
contribution...
- ...but it's really hard and you'll probably spend a lot of time banging
your head against it, unless you have a way to concurrently do
prototyping/sanity checking/whatever instead of just emerging from a cave
after 3 years with a model that nobody wants or buys into, and this is not a
good way to get tenure. (yikes...that's the second person I've
heard this from)
- if you're going to proceed anyway, keep in mind that in trying to come up
w/a model whose primitives capture a usefully large class of services,
that's comparable to coming up with a benchmark that covers those
services/primitives.
All:
- agreed that we don't know enough about each other's prior work, don't go
to enough of same conferences, still too much pariochialism. Ravi/Rafi
suggested we do a "small workshop" around ROC. I told them
that EASY is already scheduled; they looked puzzled until Ravi looked at CFP
and realized Steve Lumetta was co-organizer. ("Oh, you mean Steve
Lumetta's dependabiltiy workshop!") Ravi and Kishor will make sure at
least some of their students attend EASY and similar stuff. I will ping them
each by email as EASY approaches to remind them to send students.
-