05/08/01 Armando's Notes from HDCC
Jim Gray, Internet reliability.
  - C.1995: SW masks mosts HW faults, HW was great.  "Hidden"
    SW outages (new s/w, online upgrades, etc) motivated spending on improving
    SW; Heisenbugs seemed to define 30-yr MTTF ceiling.  "A good goal
    for availability today would be 100 year MTTF"
 
  - Tandem technology today: <30sec restart/failover
 
  - Today: RAID standard, "cluster in a box" (commodity failover),
    remote replication standard.  Hope was for five nines
 
  - Today: cellpones 90%, websites 98%, day-long outages (problems harder to
    diagnose and fix).  (1990: normal phone 5 nines, ATMs 4 nines,
    hour-long outages)  Also, hackers are now main source of Internet
    outages.  
 
  - Why complexity compared to old xact servers? Many heterogeneous layers,
    each of which under constant churn: ISP,firewall, Web,DMZ, app server, DB,
    ..., plus administrator skill level very reduced
 
  - Typical MS data center: 500 4 or 8 headed servers, crossbars, etc. 
    Hotmail: 7K servers, 100 backend stores totaling 120TB spread across 3 data
    centers, 1B msgs/day, 150M mailboxes of which 100M active, 400K new
    mailboxes/day, software upgrades every 3 mo. (eg recently switched from
    FreeBSD to Windows...[I wonder what the experience was like])
 
  - New threats from hackers: all systems open (can be attacked from
    anywhere); complexity makes them hard to protect; concentration of wealth
    makes them targets.
 
  - Check out Iinternet health report and measurements from SLAC
 
  - Most large sites build their own instrumentation - vast repeated work!
    (despite commercial attempts to systematize it)... is part of the problem
    the lack of availability benchmarking standards?
 
  - Netcraft: asks system how long it's been since site was rebooted [does
    that mean per node average? or...?]  [Check Netcraft website!]
 
  - MS.com: each individual backend server is four 9's!  BUt overall
    system is LESS...system too complex, software/architecture changing
    quarterly, less skilled operators (took 18 hrs to diagnose an induced router
    misconfigurations), etc.  Ebay is most honest, they publish their
    operations logs.  [Geo, check it out!]  Recent results: 99% of scheduled
    uptime, about 2 hrs scheduled downtime/week.
 
  - We have great node-level availability, terrible system-level availability.
    Interesting observation from Ed Adams: most bugs in mature systems are 50K-MTTF
    bugs.  BUt there are so many of them that you'll hit one every couple
    of years!
 
  - In Tandem days, people wanted faster and cheaper than IBM but just as
    reliable.  Today, people are willing to tolerate stale data,
    insecurity, etc. in return for features, basic availability, etc.
 
  - Recommendation: live in fear! we have 10K-node clusters, headed for
    1M-node clusters.
 
  - Recommendation: more systematic measurements, automated management
 
  - Recommendation: change security rules! No anonymous access, unified
    auth/authorization model, only one kind of interface.  Otherwise we
    can't win.  Firewalls at packet level can't stop current
    attacks. (end-to-end argument)  Detecting attacks and responding to
    them must be app-level.
 
  - Recommendation: single-function "appliance" servers with low
    churn.  [But how to accommodate new features?]
 
  - "Hard to publish/hard to get tenure" in holistic dependability;
    journals want proofs, etc.
 
Questions from audience:
  - We should have a computational model analogous to transactions, for
    Internet services
 
Observation from me:
  - We need a computation model that (a) admits of long system MTTF
    times despite shorter MTTFs of components, (b) captures "Internet
    semantics" (soft state, stale data, etc), (c) allows avaiabilty
    benchmarking that is well-defined with respect to a particular app (by
    virtue of mapping the app to the model).
 
Breakout sessions:
Bill Scherlis: projects for new NASA/CMU/etc testbed, eg open source
dependability tools, etc
David Garlan: in charge of new SWE curriculum @ CMU.  Tell him to teach
restartability-centric, fault-isolated, etc design in ugrad SWE classes.
Lynn Wheeler, CTO, First Data (the actual operators of Visa and MC's
network)
  - 90% of all Visa/MC xacts done at 6 processing centers (not all operated by
    FDC)
 
  - Started out in batch/xact world, recently providing support for
    e-commerce.  TCP/IP has no infrastructure on service side for
    troubledeterminination; service spec is not defined anywhere (only the
    protocol is).  FDC built a "trouble desk grid" with about 40
    failrue modes and 5 states,for use in troubleshooting TCP-related trouble
    calls; in each case trouble desk must demonsrtate recoverability, isolation,
    and/or problem determination.
 
  - Similarly: ISO standards for a credit card xact assume the connection is
    ckt-based with end-to-end diagnostics built in. This doesn't exist with
    TCP-based infrastructure, so couldn't just "reimplement the spec"
    over TCP.  Inventing these for payment gateway resulted in about 5x
    code increase, plus ancillary procedures put in place.
 
  - Rule of thumb: Given a good trained programmer who can write apps to spec,
    adding dependability increases code size by a factor of 4 to 5.
 
Can we actually trade cycles (nearly free) for dependability?  Simple
example: in one case with FDC, string buffer overflow cost billions of dollars
in downtime, diagnosis, etc.  Original problem: programmers thought they
could save a few machine instructions by making assumptions about string length,
proper termination of strings, etc. rather than coding defensively.  To
what extent could we integrate code like this into libraries, runtime,
etc.?  Would this be a good contribution to the open-source project testbed
(a "dependable" or "safer" libc that is more expensive)?
Steve Gonzalez, Chief, Ops Rsrch and Strategic Devel, NASA JSC
  - Before: breaks between missions (continuous operation only required during
    a mission), redundant isolated flight control rooms, each console hardwired
    for one function (cannot fail over to another console), full HW isolation
    (if flying on one system, others can be offline, under test, etc)
 
  - Now: everything distributed, incl. application development (each mission
    team may do its own control, visualization, etc), hybrid HW/SW isolation
    [was this a good idea??]
 
  - Custom vs COTS: expertise is offsite, yet they need <1min failover
    during missions...they are often pushing the envelope of performance for
    COTS components.  
 
  - With ISS, env is constantly changing/evolving, making 24x7 support very
    hard. Had to come up with a scheme to allow live upgrades.  Current rec
    is to go back to full HW isolation!
 
  -  
 
Bruce Maggs, Akamai/CMU
  - Failures seen at Akamai: congestion at public peering points,
    misconfigured routers/switches, inaccessible networks.  Congestion at
    public peering points is bad because for contractual reasons the operators
    can't change route tables to ease congestion.  Esp. in less developed
    NW infrastructure area, sometimes subnets become unreachable for weeks at a
    time.  Makes software upgrades hard (policy is to make everything
    backwards compatible).
 
  - Streaming: streams sent from source to as many as 4 different backbones
    via "reflectors".  The reflectors redundantly distribute
    streams to the edges, i.e. a given edge cluster may have redundant feeds
    from different backbones.  Duplicate filtering, etc is done at edge
    cluster, where stream is then redistributed.  Note, no buffering! 
    A little bit at the edge clusters, but it's OK to lose the occasional
    packet.
 
  - HW/server failures: they operate 10K machines on 500+ ISP's.  All
    machines either Linux or win2k controlled over ssh.  Interesting
    failures: SIMMS pop out of sockets (and machine doesn't crash!), network
    cards not in slot, switches configured to drop broadcasts.
 
  - IP related software failure: machine 1 dies; machine 2 does IP-addr
    stealing, but then gets infected with whatever killed machine 1; then local
    DNS server started shutting off other machines there.  Problem: someone
    had turned on a router with weird MTU, and Linux had a bug where it couldn't
    compensate for weird MTU size.
 
  - Bugs vs features: IP-stealing to react to server failures is usually a
    feature, but turned out to cause cascading failure in case above. 
    [Similar problems in TACC, or any system where a systematic bug can take out
    one node at a time]
 
  - Reporting tools must be as reliable as system itself - if tools say it's
    not working, that's just as bad as watching the failure in action.
 
  - Content providers get both weekly traffic reports and near-real-time data
    on where the hits are coming from.
 
  - Attacks: internal bugs have been much larger source of problems. 
    Weak spot in system is top-level nameservers (as is true for the DNS root
    servers).
 
  - Testing: one aspect of testing allows them to deploy stuff on the live
    system in a "passive" or shadow mode: it does everything except
    send bytes.  (Like make -n)
 
  - Design: no single point of failure; rule of thumb is require 4
    simultaneous failures on different networks to cause a sytemwide failure,
    and 2 simultaneous on different networks to cause a single user-perceived
    failure.  This doesn't count "systematic vulnerabilities"
    like the case above; they've considered porting to multiple OS's, etc, but
    there's a lot of overhead involved.  Currently they rely on Linux for
    everything except serving Windows Media streams.
 
  - No humans in operations loop except for SW upgrades.
 
  - Failover at multiple scales (steal IP, move to another local site, shut
    down DNS subtrees, etc).  Sophisticated optimization algorithms and
    comprehensive network maps, doing data cleaning on probes to figure out
    which ones indicate real problems, etc.
 
  - Multiple and disjoint reporting systems (ie the monitoring system is not
    used to actually load balance, and the measurement network used for billing
    is also separate).
 
  -  
 
HDCC brainstorming
How to define a computational model?  One possibility -start relaxing
Acid constraints one by one, since internet systems are query-like. 
Another - instead of capturing something absolute, like conisistency, capture
something like eventual consistency (consistency x latency), or basic
availability (consistency x availability), or something like Aaron's avail
benchmarking or Amin's conits, or "OK to say no"-ness.
Jim Gray: snapshotting filesystems - rotating RAID-0 on 2 out of 3 disks
Can machine virtualization be used to achieve the fault-isolation for
restartability through modularity (FIRM)?
Plans for near term
  - RR app + runtime system -> iroom
    
      - Talk to Mendel about VM stuff: fault isolation? fault injection (fake
        the drivers?)
 
      - how much does wolfpack do now?
 
      - Which mechanisms will we use in platform
 
      - what end-to-end and module-specific probes to use? how to express
        them?
 
      - Jamie: application to ground station s/w
 
      - find out what kind of fault injection we need
 
    
   
  - Restructuring apps to make components restartable by separating out the
    state (generalized transformations) - from Fox/Mazer ms, and then isolating
    the pieces and doing fault injection (need to pick an app)
 
  - Error handling models for Paths: carry exceptions along? What has been
    done in dataflow? pipelining? are "precise exceptions"
    needed?  Talk to Jim Larus also.
 
  - Computation model to emerge from readings: approximate answers, eventual
    consistency
 
  - Setup a day for: iRoom demos, talk w/Mendel, talk w/Geo, talk
    w/Emre/Laurence, + lunch and social.
 
  - Target next fall for "The Case For ROC" papers
 
Possible projects
  - low level isolation mechanisms
 
  - "synthesis" paper (common ideas across different communities)
 
"Resilient data structures" - internally redundant so they can be
fixed/tolerate losses - would this be useful?
A thought for inexact answers: a utility function that measures answer
"quality" vs. resource consumption (memory, time ...).  It may
have a cliff, or the user may have a threshold, for answer quality vs. resource.