Robustness in Complex Systems

Robustness in Complex Systems

Steven D. Gribble
Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII) [PDF]

Summary by
Andy Huang

One-sentence summary:

The author observes that coupling among components in complex systems make the systems fragile in that small perturbations can lead to system-wide failure. The author also presents design techniques that can make systems less fragile.

Overview/Main Points

Most systems are "made robust" by precognition -- predicting operating conditions for the system and then architecting the system to operate well in those conditions. "Any system that attempts to gain robustness solely through precognition is prone to fragility." In the author's experience with DDS, he observed that small perturbations in the operating environment often led to a violation of the design assumptions, which in turn lead to system-wide failure.
DDS experience:
- Garbage collection: When DDS operated near saturation, slight differences in load put pressure on the garbage collector of certain bricks (storage nodes that participate in 2-phase-commit). Once these nodes fell behind, more active objects were left in the heap, and they experienced further performance degradation, dragging down the performance of the entire storage group.
- Deterministic bug: Since the same code runs in all bricks, one deterministic bug led to all bricks failing, violating the independent failures assumption.
- Slow memory leak: When the system was launched, all bricks are started at the same time. Given balanced load across all bricks and a small memory leak, all bricks ran out of heap space at nearly the same time. Failover exacerbated the situation by increasing load (and the memory leak rate) of surviving bricks once bricks started to run out of memory. The uniformity of workload was a source of coupling among the bricks, and caused a violation in the independent failures assumption.
Robustification techniques:
- Overprovisioning: Systems tend to be less stable when operating near-capacity. The difficult part of overprovisioning is that it's hard to predict the operating regime of the system.
- Admission control: Reject load to avoid near-capacity operation. The difficult part of admission control is being able to identify the saturation point. Further, it is important to remember that rejecting load also consumes some resources (think of it as a minimal service of the system).
- Introspection and adaptation: Create systems that can detect aberrant behavior and adapt control variables dynamically to keep the system operating in a stable regime (e.g., TCP congestion control).
- Plan for failure: Use robust abstractions (e.g., transactions) to avoid data loss, improve recovery time (e.g., through checkpointing), decouple components, and/or proactively "scrub" internal state (e.g., thorugh reboots).

Back to index

Summaries may be used for non-commercial purposes only, provided the summary's author and origin are acknowledged. For all other uses, please contact us.