long title

Lisa Spainhower, Tom Gregg, IBM eServer division

One-line summary: A lot of HW FT support, including ECC-protected shadow state that allows lockstep CPU's to backup one instruction and retry to eliminate a transient fault ("soft error"), and extensive SEC/DED protection on caches and memory. All I/O has multiple redundant ports, power subsystems are N+1 fault tolerant. Summary: instruction retry recovers from soft (transient) hw faults; CPU sparing (ship checkpointed CPU state to a spare) deals w/hard errors; combination of ckt-level detection and uarch checkpointing takes 25-40% chip area, but doesn't impact cycle time or CPI.

Overview/Main Points

Highlights of mainframe (360...390, 3090, 9020 series) hardware FT:

All caches and main mem are SEC/DED (72,64) ECC protected; L2 cache also has "mark this line permanently invalid" to avoid using cache lines gone bad. (Other designs discussed in paper have S4EC/D4ED)
Background scrubbing for all caches. (What happens if b/g scrub detects uncorrectable error? Do two complements and look for stuck bits in the result.)
Line sparing: all data on that line move to a new chip. Each L2 has ~10 lines reserved for this.
Logic supports back-up of one instruction from shadow state, to re-execute if a transient fault is suspscted. If fault detected twice, it's considered permanent.
Twin I-units/E-units run in lockstep. Fault (disagreement) is detected as a transient; shadow state (1 instruction behind) is in an orthogonal R-unit (recovery?); checkpointed CPU state is reloaded and instr is retried. (The R-unit state is also ECC protected.) Rest of machine "never sees" backup/recovery except as added latency. If after 2 tries there is still disagreement, fault is considered permanent, checkpointed state is scanned into a spare CPU, and that spare CPU takes over in midstream. SInce L1 is included in L2, cache state doesn't need to be transferred.
L1 stores are also written to a Store Buffer. SB entires are retired (marked as committed to L2 transaction) when the write instruction retires successfully; if it suffers a permanent fault, the appropriate SB entry is squashed and never makes it out of L1.
If R-unit has a failure, try to collect checkpointed state and then stop the CPU. R-unit is only about 4% of chip area and is ECC protected.
If CPU hard logic couldn't fail over to a spare (eg caused app loss), would cut Mean Time to App loss from 24 yr to 11 yr. If arrays (L1,L2, Branch Hist Tbl) fails coudln't fail over to a spare, would cut MTTAL from 11 to 5 yr. What if CPU soft failures weren't caught? (such faults have observed these in practice) Not sure what effect would be. Since L1 is write-thru, they see only about 5% of predicted HW transient error rate (according to cache supplier); in L2 and in logic it's much higher. How exposed is this design to predicted increase in logic errors (due to smaller feature size, etc)? Not sure.
Chip area costs of the extra logic are 25-40%.
Current designs driven by need to work w/existing solutions, etc. - if you were starting over, might do something difference.
In poorly-managed S/390, 68-82% of outages are "process" (read: people/mgt) problems. A well-managed one had ~18hrs/year.
In well-managed (and audited) Unix installation (don't touch the apps, run 1 rev behind on OS), 43% of unplanned downtime was from HW, 18% from network, 7% each from OS/Apps, 6% "process", 5% "human error". These results were same for a standalone Unix and clustered Unix.
HP stats for a 1998 Unix investigation: 8.6-76 min/yr for 2 nodes; but doesn't necessarily account for app recovery, etc.
Lessons: mgt discipline critical; given that, FT servers do make a difference (ie they probably don't make a difference with poor mgt); clusters are difficult to implement, and the HA potential is hard to achieve, and often the failover is not really transparent.
Things are getting worse: price/perf is going down (encoueraging use of currently unsophisticated components), workloads bigger and more complex, skilled operators/mgt in short supply.

Relevance

What else might you do? Move redundancy further down in logic (e.g. parity or ECC protected ALU sub-fubs)?
Dave P: IBM mainframes don't always have enough spare CPU's. Could we in the future exploit Moore's law to build sealed modules that will "always have more than enough" spares?

Flaws

Assumes the software is correct, and just assures that it is executing as designed.

Back to index