Back to index
long title
Lisa Spainhower, Tom Gregg, IBM eServer division
One-line summary: A lot of HW FT support, including
ECC-protected shadow state that allows lockstep CPU's to backup one instruction
and retry to eliminate a transient fault ("soft error"), and extensive
SEC/DED protection on caches and memory. All I/O has multiple redundant
ports, power subsystems are N+1 fault tolerant. Summary: instruction retry
recovers from soft (transient) hw faults; CPU sparing (ship checkpointed CPU
state to a spare) deals w/hard errors; combination of ckt-level detection and
uarch checkpointing takes 25-40% chip area, but doesn't impact cycle time or
CPI.
Overview/Main Points
Highlights of mainframe (360...390, 3090, 9020 series) hardware FT:
- All caches and main mem are SEC/DED (72,64) ECC protected; L2 cache also
has "mark this line permanently invalid" to avoid using cache
lines gone bad. (Other designs discussed in paper have S4EC/D4ED)
- Background scrubbing for all caches. (What happens if b/g scrub
detects uncorrectable error? Do two complements and look for stuck
bits in the result.)
- Line sparing: all data on that line move to a new chip. Each L2 has
~10 lines reserved for this.
- Logic supports back-up of one instruction from shadow state, to re-execute
if a transient fault is suspscted. If fault detected twice, it's
considered permanent.
- Twin I-units/E-units run in lockstep. Fault (disagreement) is
detected as a transient; shadow state (1 instruction behind) is in an
orthogonal R-unit (recovery?); checkpointed CPU state is reloaded and instr
is retried. (The R-unit state is also ECC protected.) Rest of
machine "never sees" backup/recovery except as added
latency. If after 2 tries there is still disagreement, fault is
considered permanent, checkpointed state is scanned into a spare CPU, and
that spare CPU takes over in midstream. SInce L1 is included in L2,
cache state doesn't need to be transferred.
- L1 stores are also written to a Store Buffer. SB entires are retired
(marked as committed to L2 transaction) when the write instruction retires
successfully; if it suffers a permanent fault, the appropriate SB entry is
squashed and never makes it out of L1.
- If R-unit has a failure, try to collect checkpointed state and then stop
the CPU. R-unit is only about 4% of chip area and is ECC
protected.
- If CPU hard logic couldn't fail over to a spare (eg caused app loss),
would cut Mean Time to App loss from 24 yr to 11 yr. If arrays (L1,L2,
Branch Hist Tbl) fails coudln't fail over to a spare, would cut MTTAL from
11 to 5 yr. What if CPU soft failures weren't caught? (such
faults have observed these in practice) Not sure what effect would be.
Since L1 is write-thru, they see only about 5% of predicted HW transient
error rate (according to cache supplier); in L2 and in logic it's much
higher. How exposed is this design to predicted increase in logic
errors (due to smaller feature size, etc)? Not sure.
- Chip area costs of the extra logic are 25-40%.
- Current designs driven by need to work w/existing solutions, etc. - if you
were starting over, might do something difference.
- In poorly-managed S/390, 68-82% of outages are "process" (read:
people/mgt) problems. A well-managed one had ~18hrs/year.
- In well-managed (and audited) Unix installation (don't touch the apps, run
1 rev behind on OS), 43% of unplanned downtime was from HW, 18% from
network, 7% each from OS/Apps, 6% "process", 5% "human
error". These results were same for a standalone Unix and
clustered Unix.
- HP stats for a 1998 Unix investigation: 8.6-76 min/yr for 2 nodes; but
doesn't necessarily account for app recovery, etc.
- Lessons: mgt discipline critical; given that, FT servers do make a
difference (ie they probably don't make a difference with poor mgt);
clusters are difficult to implement, and the HA potential is hard to
achieve, and often the failover is not really transparent.
- Things are getting worse: price/perf is going down (encoueraging use of
currently unsophisticated components), workloads bigger and more complex,
skilled operators/mgt in short supply.
Relevance
- What else might you do? Move redundancy further down in logic (e.g.
parity or ECC protected ALU sub-fubs)?
- Dave P: IBM mainframes don't always have enough spare CPU's. Could
we in the future exploit Moore's law to build sealed modules that will
"always have more than enough" spares?
Flaws
Assumes the software is correct, and just assures that it is executing as
designed.
Back to index