|Steven D. Gribble
Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII)
The author observes that coupling among components in complex
systems make the systems fragile in that small perturbations can lead
to system-wide failure. The author also presents design techniques
that can make systems less fragile.
- Most systems are "made robust" by precognition --
predicting operating conditions for the system and then architecting
the system to operate well in those conditions. "Any system that
attempts to gain robustness solely through precognition is prone to
fragility." In the author's experience with DDS, he observed that
small perturbations in the operating environment often led to a
violation of the design assumptions, which in turn lead to
- DDS experience:
- Garbage collection: When DDS operated near saturation, slight
differences in load put pressure on the garbage collector of
certain bricks (storage nodes that participate in 2-phase-commit).
Once these nodes fell behind, more active objects were left in the
heap, and they experienced further performance degradation,
dragging down the performance of the entire storage group.
- Deterministic bug: Since the same code runs in all bricks, one
deterministic bug led to all bricks failing, violating the
independent failures assumption.
- Slow memory leak: When the system was launched, all bricks are
started at the same time. Given balanced load across all bricks
and a small memory leak, all bricks ran out of heap space at
nearly the same time. Failover exacerbated the situation by
increasing load (and the memory leak rate) of surviving bricks
once bricks started to run out of memory. The uniformity of
workload was a source of coupling among the bricks, and caused a
violation in the independent failures assumption.
- Robustification techniques:
- Overprovisioning: Systems tend to be less stable when
operating near-capacity. The difficult part of overprovisioning
is that it's hard to predict the operating regime of the system.
- Admission control: Reject load to avoid near-capacity
operation. The difficult part of admission control is being able
to identify the saturation point. Further, it is important to
remember that rejecting load also consumes some resources (think
of it as a minimal service of the system).
- Introspection and adaptation: Create systems that can detect
aberrant behavior and adapt control variables dynamically to keep
the system operating in a stable regime (e.g., TCP congestion
- Plan for failure: Use robust abstractions (e.g., transactions)
to avoid data loss, improve recovery time (e.g., through
checkpointing), decouple components, and/or proactively "scrub"
internal state (e.g., thorugh reboots).