Lisa Spainhower, Tom Gregg, IBM eServer division

One-line summary:  A lot of HW FT support, including ECC-protected shadow state that allows lockstep CPU's to backup one instruction and retry to eliminate a transient fault ("soft error"), and extensive SEC/DED protection on caches and memory.  All I/O has multiple redundant ports, power subsystems are N+1 fault tolerant.  Summary: instruction retry recovers from soft (transient) hw faults; CPU sparing (ship checkpointed CPU state to a spare) deals w/hard errors; combination of ckt-level detection and uarch checkpointing takes 25-40% chip area, but doesn't impact cycle time or CPI.

Overview/Main Points

Highlights of mainframe (360...390, 3090, 9020 series) hardware FT:



Assumes the software is correct, and just assures that it is executing as designed.

