George's trip report from the 4th SIGDEB Meeting

http://www.cs.cmu.edu/~koopman/ifip_wg_10_4_sig

The meeting was focused on refining our working document on dependability classes. At this point we have 5 different axes that define the space of dependability for benchmarking purposes: application, availability, data integrity, survivability, and security (we renamed survivability to something else, but I forgot the exact term). In this email I'll summarize those aspects that are unlikely to be contained in a forthcoming summary from the SIG chair.

>From my perspective, the highlight of the meeting was a talk given by Scott Swaney, who has been working on IBM's S/390 and Z/900 mainframe architectures for the past 12 years. Unfortunately TLAs were too frequent -- the talk was a lot about getting CEs and UEs in the CP and the SAP, which would cause the GFU to dynamically remap the FWT onto the PCM... But once I found my way out of the fog, I found it very exciting (try to guess why)... Overview: their hardware has tons of error checking and checkpointing; when they encounter an error, they restart the instruction stream to the spot before the error, as they assume the fault was transient. If the problem repeats more than a predetermined number of times, they give up and either fail over to spare components or, if that's not possible, halt the machine. Scott made the bold claim that they will detect any possible hardware error.

Here are some bulleted notes:

- They have a "correctable error" (CE) / "unrecoverable error" (UE) counter for each bit address line. Whenever the CE or UE threshold is exceeded, they map the entire address line to a redundant memory bank. Of course, the core assumption is single-error-occurence; multiple errors would take them down (or centralized things, such as the system oscillator dying). On the other hand, they do things like interlace bits in memory banks so that you can blow out 16 adjacent bits in a memory array and still allow it to recover the data with based on the line ECCs.

- They run 4 different OSs natively (MVS, Linux, and two others I had never heard of). While most of the memory errors at the level of L1 and L2 are solved in hw, L3 errors will propagate to OS, so the OS needs to know how to deal recover.

- However, they ship all their machines with spare CPUs and have the mechanisms to copy system state from a faulty CPU to a spare and restart the execution in such a way that the OS will never notice. Interesting non-end-to-end approach :-) They claim these machines crash due to hardware once every 30 years.

- The OS/390 actually has wires hanging out of the motherboard and component interfaces that allows IBM engineers to inject errors such as flipping voltages, shorting them, etc. using so-called "bug boxes." Moreover, they can set up their fault injectors such that they inject faults whenever the debugging facility is ready to follow the effects. The coolest part, though, is that they can have particular faults injected based on various events, such a particular sequence of opcodes being executed. The customer, however, cannot activate any of these testing mechanisms, as they are cryptographically protected (a customer could, however, call out an IBM engineer to do it for them). The argument was that IBM mainframes and their applications are provided as highly available end-to-end solutions, not regular computer systems.

- On the "public availability" of the technology behind all this stuff, Lisa Spainhower commented that S/390 customers are like a "country club": you either belong or you don't. Those who belong, know about all these features and the proprietary technology behind it. The others don't have access to it.

- On a final note, Scott thought that all of this is overkill. There are parts of the logic that he doesn't even know if it ever got exercised in the field.

- An interesting conclusion was that the particular errors you inject don't really matter as much as WHEN you inject them (i.e., what matters is the state in which your system is when the error happens).

Some interesting tidbits from the other discussions:

- What defines a benchmark is industry agreement, not necessarily perfectness. Hence, once you have something that is good enough and industry agrees on it, you have an accepted benchmark.

- When choosing a particular system for a high-dependability-required environment, the admin and site where the system is installed are very important. Brendan Murphy (Microsoft Research) claimed that he can make any OS be the best, depending on where it's installed and who runs it.

- Benchmarking typically needs to rely on mature technologies and techniques. The reason is both for buyin from industry, as well as the fact that you need confidence in the accuracy and correctness of the technique.

- We all agreed that we don't want to turn SIGDeB into another TPC-like effort. TPC benchmarks are a curse on industry: they hate them, but feel compelled to run them because of competitive pressures. TPC results (in particular TPC-C) are highly irrelevant for comparing machines from different vendors; at best you can compare two different generations of machines from the same family. IBM will not embark on doing a TPC benchmark if they are not sure they'll come out among the top few. The costs are enormous: Nick Bowen, VP of ... at IBM, has just spent $8 million on a configuration for TPC-C with tons of servers, 7 thousand disks, and other hardware that could not fit in a large conference room. No customer ever purchases the systems on which TPC is run, and hardly anyone can claim that the TPC-C is remotely related to their own workload.