05/08/01 Armando's Notes from HDCC
Jim Gray, Internet reliability.
- C.1995: SW masks mosts HW faults, HW was great. "Hidden"
SW outages (new s/w, online upgrades, etc) motivated spending on improving
SW; Heisenbugs seemed to define 30-yr MTTF ceiling. "A good goal
for availability today would be 100 year MTTF"
- Tandem technology today: <30sec restart/failover
- Today: RAID standard, "cluster in a box" (commodity failover),
remote replication standard. Hope was for five nines
- Today: cellpones 90%, websites 98%, day-long outages (problems harder to
diagnose and fix). (1990: normal phone 5 nines, ATMs 4 nines,
hour-long outages) Also, hackers are now main source of Internet
outages.
- Why complexity compared to old xact servers? Many heterogeneous layers,
each of which under constant churn: ISP,firewall, Web,DMZ, app server, DB,
..., plus administrator skill level very reduced
- Typical MS data center: 500 4 or 8 headed servers, crossbars, etc.
Hotmail: 7K servers, 100 backend stores totaling 120TB spread across 3 data
centers, 1B msgs/day, 150M mailboxes of which 100M active, 400K new
mailboxes/day, software upgrades every 3 mo. (eg recently switched from
FreeBSD to Windows...[I wonder what the experience was like])
- New threats from hackers: all systems open (can be attacked from
anywhere); complexity makes them hard to protect; concentration of wealth
makes them targets.
- Check out Iinternet health report and measurements from SLAC
- Most large sites build their own instrumentation - vast repeated work!
(despite commercial attempts to systematize it)... is part of the problem
the lack of availability benchmarking standards?
- Netcraft: asks system how long it's been since site was rebooted [does
that mean per node average? or...?] [Check Netcraft website!]
- MS.com: each individual backend server is four 9's! BUt overall
system is LESS...system too complex, software/architecture changing
quarterly, less skilled operators (took 18 hrs to diagnose an induced router
misconfigurations), etc. Ebay is most honest, they publish their
operations logs. [Geo, check it out!] Recent results: 99% of scheduled
uptime, about 2 hrs scheduled downtime/week.
- We have great node-level availability, terrible system-level availability.
Interesting observation from Ed Adams: most bugs in mature systems are 50K-MTTF
bugs. BUt there are so many of them that you'll hit one every couple
of years!
- In Tandem days, people wanted faster and cheaper than IBM but just as
reliable. Today, people are willing to tolerate stale data,
insecurity, etc. in return for features, basic availability, etc.
- Recommendation: live in fear! we have 10K-node clusters, headed for
1M-node clusters.
- Recommendation: more systematic measurements, automated management
- Recommendation: change security rules! No anonymous access, unified
auth/authorization model, only one kind of interface. Otherwise we
can't win. Firewalls at packet level can't stop current
attacks. (end-to-end argument) Detecting attacks and responding to
them must be app-level.
- Recommendation: single-function "appliance" servers with low
churn. [But how to accommodate new features?]
- "Hard to publish/hard to get tenure" in holistic dependability;
journals want proofs, etc.
Questions from audience:
- We should have a computational model analogous to transactions, for
Internet services
Observation from me:
- We need a computation model that (a) admits of long system MTTF
times despite shorter MTTFs of components, (b) captures "Internet
semantics" (soft state, stale data, etc), (c) allows avaiabilty
benchmarking that is well-defined with respect to a particular app (by
virtue of mapping the app to the model).
Breakout sessions:
Bill Scherlis: projects for new NASA/CMU/etc testbed, eg open source
dependability tools, etc
David Garlan: in charge of new SWE curriculum @ CMU. Tell him to teach
restartability-centric, fault-isolated, etc design in ugrad SWE classes.
Lynn Wheeler, CTO, First Data (the actual operators of Visa and MC's
network)
- 90% of all Visa/MC xacts done at 6 processing centers (not all operated by
FDC)
- Started out in batch/xact world, recently providing support for
e-commerce. TCP/IP has no infrastructure on service side for
troubledeterminination; service spec is not defined anywhere (only the
protocol is). FDC built a "trouble desk grid" with about 40
failrue modes and 5 states,for use in troubleshooting TCP-related trouble
calls; in each case trouble desk must demonsrtate recoverability, isolation,
and/or problem determination.
- Similarly: ISO standards for a credit card xact assume the connection is
ckt-based with end-to-end diagnostics built in. This doesn't exist with
TCP-based infrastructure, so couldn't just "reimplement the spec"
over TCP. Inventing these for payment gateway resulted in about 5x
code increase, plus ancillary procedures put in place.
- Rule of thumb: Given a good trained programmer who can write apps to spec,
adding dependability increases code size by a factor of 4 to 5.
Can we actually trade cycles (nearly free) for dependability? Simple
example: in one case with FDC, string buffer overflow cost billions of dollars
in downtime, diagnosis, etc. Original problem: programmers thought they
could save a few machine instructions by making assumptions about string length,
proper termination of strings, etc. rather than coding defensively. To
what extent could we integrate code like this into libraries, runtime,
etc.? Would this be a good contribution to the open-source project testbed
(a "dependable" or "safer" libc that is more expensive)?
Steve Gonzalez, Chief, Ops Rsrch and Strategic Devel, NASA JSC
- Before: breaks between missions (continuous operation only required during
a mission), redundant isolated flight control rooms, each console hardwired
for one function (cannot fail over to another console), full HW isolation
(if flying on one system, others can be offline, under test, etc)
- Now: everything distributed, incl. application development (each mission
team may do its own control, visualization, etc), hybrid HW/SW isolation
[was this a good idea??]
- Custom vs COTS: expertise is offsite, yet they need <1min failover
during missions...they are often pushing the envelope of performance for
COTS components.
- With ISS, env is constantly changing/evolving, making 24x7 support very
hard. Had to come up with a scheme to allow live upgrades. Current rec
is to go back to full HW isolation!
-
Bruce Maggs, Akamai/CMU
- Failures seen at Akamai: congestion at public peering points,
misconfigured routers/switches, inaccessible networks. Congestion at
public peering points is bad because for contractual reasons the operators
can't change route tables to ease congestion. Esp. in less developed
NW infrastructure area, sometimes subnets become unreachable for weeks at a
time. Makes software upgrades hard (policy is to make everything
backwards compatible).
- Streaming: streams sent from source to as many as 4 different backbones
via "reflectors". The reflectors redundantly distribute
streams to the edges, i.e. a given edge cluster may have redundant feeds
from different backbones. Duplicate filtering, etc is done at edge
cluster, where stream is then redistributed. Note, no buffering!
A little bit at the edge clusters, but it's OK to lose the occasional
packet.
- HW/server failures: they operate 10K machines on 500+ ISP's. All
machines either Linux or win2k controlled over ssh. Interesting
failures: SIMMS pop out of sockets (and machine doesn't crash!), network
cards not in slot, switches configured to drop broadcasts.
- IP related software failure: machine 1 dies; machine 2 does IP-addr
stealing, but then gets infected with whatever killed machine 1; then local
DNS server started shutting off other machines there. Problem: someone
had turned on a router with weird MTU, and Linux had a bug where it couldn't
compensate for weird MTU size.
- Bugs vs features: IP-stealing to react to server failures is usually a
feature, but turned out to cause cascading failure in case above.
[Similar problems in TACC, or any system where a systematic bug can take out
one node at a time]
- Reporting tools must be as reliable as system itself - if tools say it's
not working, that's just as bad as watching the failure in action.
- Content providers get both weekly traffic reports and near-real-time data
on where the hits are coming from.
- Attacks: internal bugs have been much larger source of problems.
Weak spot in system is top-level nameservers (as is true for the DNS root
servers).
- Testing: one aspect of testing allows them to deploy stuff on the live
system in a "passive" or shadow mode: it does everything except
send bytes. (Like make -n)
- Design: no single point of failure; rule of thumb is require 4
simultaneous failures on different networks to cause a sytemwide failure,
and 2 simultaneous on different networks to cause a single user-perceived
failure. This doesn't count "systematic vulnerabilities"
like the case above; they've considered porting to multiple OS's, etc, but
there's a lot of overhead involved. Currently they rely on Linux for
everything except serving Windows Media streams.
- No humans in operations loop except for SW upgrades.
- Failover at multiple scales (steal IP, move to another local site, shut
down DNS subtrees, etc). Sophisticated optimization algorithms and
comprehensive network maps, doing data cleaning on probes to figure out
which ones indicate real problems, etc.
- Multiple and disjoint reporting systems (ie the monitoring system is not
used to actually load balance, and the measurement network used for billing
is also separate).
-
HDCC brainstorming
How to define a computational model? One possibility -start relaxing
Acid constraints one by one, since internet systems are query-like.
Another - instead of capturing something absolute, like conisistency, capture
something like eventual consistency (consistency x latency), or basic
availability (consistency x availability), or something like Aaron's avail
benchmarking or Amin's conits, or "OK to say no"-ness.
Jim Gray: snapshotting filesystems - rotating RAID-0 on 2 out of 3 disks
Can machine virtualization be used to achieve the fault-isolation for
restartability through modularity (FIRM)?
Plans for near term
- RR app + runtime system -> iroom
- Talk to Mendel about VM stuff: fault isolation? fault injection (fake
the drivers?)
- how much does wolfpack do now?
- Which mechanisms will we use in platform
- what end-to-end and module-specific probes to use? how to express
them?
- Jamie: application to ground station s/w
- find out what kind of fault injection we need
- Restructuring apps to make components restartable by separating out the
state (generalized transformations) - from Fox/Mazer ms, and then isolating
the pieces and doing fault injection (need to pick an app)
- Error handling models for Paths: carry exceptions along? What has been
done in dataflow? pipelining? are "precise exceptions"
needed? Talk to Jim Larus also.
- Computation model to emerge from readings: approximate answers, eventual
consistency
- Setup a day for: iRoom demos, talk w/Mendel, talk w/Geo, talk
w/Emre/Laurence, + lunch and social.
- Target next fall for "The Case For ROC" papers
Possible projects
- low level isolation mechanisms
- "synthesis" paper (common ideas across different communities)
"Resilient data structures" - internally redundant so they can be
fixed/tolerate losses - would this be useful?
A thought for inexact answers: a utility function that measures answer
"quality" vs. resource consumption (memory, time ...). It may
have a cliff, or the user may have a threshold, for answer quality vs. resource.