05/08/01 Armando's Notes from HDCC

Jim Gray, Internet reliability.

C.1995: SW masks mosts HW faults, HW was great. "Hidden" SW outages (new s/w, online upgrades, etc) motivated spending on improving SW; Heisenbugs seemed to define 30-yr MTTF ceiling. "A good goal for availability today would be 100 year MTTF"
Tandem technology today: <30sec restart/failover
Today: RAID standard, "cluster in a box" (commodity failover), remote replication standard. Hope was for five nines
Today: cellpones 90%, websites 98%, day-long outages (problems harder to diagnose and fix). (1990: normal phone 5 nines, ATMs 4 nines, hour-long outages) Also, hackers are now main source of Internet outages.
Why complexity compared to old xact servers? Many heterogeneous layers, each of which under constant churn: ISP,firewall, Web,DMZ, app server, DB, ..., plus administrator skill level very reduced
Typical MS data center: 500 4 or 8 headed servers, crossbars, etc. Hotmail: 7K servers, 100 backend stores totaling 120TB spread across 3 data centers, 1B msgs/day, 150M mailboxes of which 100M active, 400K new mailboxes/day, software upgrades every 3 mo. (eg recently switched from FreeBSD to Windows...[I wonder what the experience was like])
New threats from hackers: all systems open (can be attacked from anywhere); complexity makes them hard to protect; concentration of wealth makes them targets.
Check out Iinternet health report and measurements from SLAC
Most large sites build their own instrumentation - vast repeated work! (despite commercial attempts to systematize it)... is part of the problem the lack of availability benchmarking standards?
Netcraft: asks system how long it's been since site was rebooted [does that mean per node average? or...?] [Check Netcraft website!]
MS.com: each individual backend server is four 9's! BUt overall system is LESS...system too complex, software/architecture changing quarterly, less skilled operators (took 18 hrs to diagnose an induced router misconfigurations), etc. Ebay is most honest, they publish their operations logs. [Geo, check it out!] Recent results: 99% of scheduled uptime, about 2 hrs scheduled downtime/week.
We have great node-level availability, terrible system-level availability. Interesting observation from Ed Adams: most bugs in mature systems are 50K-MTTF bugs. BUt there are so many of them that you'll hit one every couple of years!
In Tandem days, people wanted faster and cheaper than IBM but just as reliable. Today, people are willing to tolerate stale data, insecurity, etc. in return for features, basic availability, etc.
Recommendation: live in fear! we have 10K-node clusters, headed for 1M-node clusters.
Recommendation: more systematic measurements, automated management
Recommendation: change security rules! No anonymous access, unified auth/authorization model, only one kind of interface. Otherwise we can't win. Firewalls at packet level can't stop current attacks. (end-to-end argument) Detecting attacks and responding to them must be app-level.
Recommendation: single-function "appliance" servers with low churn. [But how to accommodate new features?]
"Hard to publish/hard to get tenure" in holistic dependability; journals want proofs, etc.

Questions from audience:

We should have a computational model analogous to transactions, for Internet services

Observation from me:

We need a computation model that (a) admits of long system MTTF times despite shorter MTTFs of components, (b) captures "Internet semantics" (soft state, stale data, etc), (c) allows avaiabilty benchmarking that is well-defined with respect to a particular app (by virtue of mapping the app to the model).

Breakout sessions:

Bill Scherlis: projects for new NASA/CMU/etc testbed, eg open source dependability tools, etc

David Garlan: in charge of new SWE curriculum @ CMU. Tell him to teach restartability-centric, fault-isolated, etc design in ugrad SWE classes.

Lynn Wheeler, CTO, First Data (the actual operators of Visa and MC's network)

90% of all Visa/MC xacts done at 6 processing centers (not all operated by FDC)
Started out in batch/xact world, recently providing support for e-commerce. TCP/IP has no infrastructure on service side for troubledeterminination; service spec is not defined anywhere (only the protocol is). FDC built a "trouble desk grid" with about 40 failrue modes and 5 states,for use in troubleshooting TCP-related trouble calls; in each case trouble desk must demonsrtate recoverability, isolation, and/or problem determination.
Similarly: ISO standards for a credit card xact assume the connection is ckt-based with end-to-end diagnostics built in. This doesn't exist with TCP-based infrastructure, so couldn't just "reimplement the spec" over TCP. Inventing these for payment gateway resulted in about 5x code increase, plus ancillary procedures put in place.
Rule of thumb: Given a good trained programmer who can write apps to spec, adding dependability increases code size by a factor of 4 to 5.

Can we actually trade cycles (nearly free) for dependability? Simple example: in one case with FDC, string buffer overflow cost billions of dollars in downtime, diagnosis, etc. Original problem: programmers thought they could save a few machine instructions by making assumptions about string length, proper termination of strings, etc. rather than coding defensively. To what extent could we integrate code like this into libraries, runtime, etc.? Would this be a good contribution to the open-source project testbed (a "dependable" or "safer" libc that is more expensive)?

Steve Gonzalez, Chief, Ops Rsrch and Strategic Devel, NASA JSC

Before: breaks between missions (continuous operation only required during a mission), redundant isolated flight control rooms, each console hardwired for one function (cannot fail over to another console), full HW isolation (if flying on one system, others can be offline, under test, etc)
Now: everything distributed, incl. application development (each mission team may do its own control, visualization, etc), hybrid HW/SW isolation [was this a good idea??]
Custom vs COTS: expertise is offsite, yet they need <1min failover during missions...they are often pushing the envelope of performance for COTS components.
With ISS, env is constantly changing/evolving, making 24x7 support very hard. Had to come up with a scheme to allow live upgrades. Current rec is to go back to full HW isolation!

Bruce Maggs, Akamai/CMU

Failures seen at Akamai: congestion at public peering points, misconfigured routers/switches, inaccessible networks. Congestion at public peering points is bad because for contractual reasons the operators can't change route tables to ease congestion. Esp. in less developed NW infrastructure area, sometimes subnets become unreachable for weeks at a time. Makes software upgrades hard (policy is to make everything backwards compatible).
Streaming: streams sent from source to as many as 4 different backbones via "reflectors". The reflectors redundantly distribute streams to the edges, i.e. a given edge cluster may have redundant feeds from different backbones. Duplicate filtering, etc is done at edge cluster, where stream is then redistributed. Note, no buffering! A little bit at the edge clusters, but it's OK to lose the occasional packet.
HW/server failures: they operate 10K machines on 500+ ISP's. All machines either Linux or win2k controlled over ssh. Interesting failures: SIMMS pop out of sockets (and machine doesn't crash!), network cards not in slot, switches configured to drop broadcasts.
IP related software failure: machine 1 dies; machine 2 does IP-addr stealing, but then gets infected with whatever killed machine 1; then local DNS server started shutting off other machines there. Problem: someone had turned on a router with weird MTU, and Linux had a bug where it couldn't compensate for weird MTU size.
Bugs vs features: IP-stealing to react to server failures is usually a feature, but turned out to cause cascading failure in case above. [Similar problems in TACC, or any system where a systematic bug can take out one node at a time]
Reporting tools must be as reliable as system itself - if tools say it's not working, that's just as bad as watching the failure in action.
Content providers get both weekly traffic reports and near-real-time data on where the hits are coming from.
Attacks: internal bugs have been much larger source of problems. Weak spot in system is top-level nameservers (as is true for the DNS root servers).
Testing: one aspect of testing allows them to deploy stuff on the live system in a "passive" or shadow mode: it does everything except send bytes. (Like make -n)
Design: no single point of failure; rule of thumb is require 4 simultaneous failures on different networks to cause a sytemwide failure, and 2 simultaneous on different networks to cause a single user-perceived failure. This doesn't count "systematic vulnerabilities" like the case above; they've considered porting to multiple OS's, etc, but there's a lot of overhead involved. Currently they rely on Linux for everything except serving Windows Media streams.
No humans in operations loop except for SW upgrades.
Failover at multiple scales (steal IP, move to another local site, shut down DNS subtrees, etc). Sophisticated optimization algorithms and comprehensive network maps, doing data cleaning on probes to figure out which ones indicate real problems, etc.
Multiple and disjoint reporting systems (ie the monitoring system is not used to actually load balance, and the measurement network used for billing is also separate).

HDCC brainstorming

How to define a computational model? One possibility -start relaxing Acid constraints one by one, since internet systems are query-like. Another - instead of capturing something absolute, like conisistency, capture something like eventual consistency (consistency x latency), or basic availability (consistency x availability), or something like Aaron's avail benchmarking or Amin's conits, or "OK to say no"-ness.

Jim Gray: snapshotting filesystems - rotating RAID-0 on 2 out of 3 disks

Can machine virtualization be used to achieve the fault-isolation for restartability through modularity (FIRM)?

Plans for near term

RR app + runtime system -> iroom
- Talk to Mendel about VM stuff: fault isolation? fault injection (fake the drivers?)
- how much does wolfpack do now?
- Which mechanisms will we use in platform
- what end-to-end and module-specific probes to use? how to express them?
- Jamie: application to ground station s/w
- find out what kind of fault injection we need
Restructuring apps to make components restartable by separating out the state (generalized transformations) - from Fox/Mazer ms, and then isolating the pieces and doing fault injection (need to pick an app)
Error handling models for Paths: carry exceptions along? What has been done in dataflow? pipelining? are "precise exceptions" needed? Talk to Jim Larus also.
Computation model to emerge from readings: approximate answers, eventual consistency
Setup a day for: iRoom demos, talk w/Mendel, talk w/Geo, talk w/Emre/Laurence, + lunch and social.
Target next fall for "The Case For ROC" papers

Possible projects

low level isolation mechanisms
"synthesis" paper (common ideas across different communities)

"Resilient data structures" - internally redundant so they can be fixed/tolerate losses - would this be useful?

A thought for inexact answers: a utility function that measures answer "quality" vs. resource consumption (memory, time ...). It may have a cliff, or the user may have a threshold, for answer quality vs. resource.