Back to index
The Synergy Between Non-Blocking Synchronization and Operating System Structure
Michael Greenwald and David Cheriton, Stanford
One-line summary:
Use CAS-2 (double compare and swap, ideally in hardware) instead of blocking
synchronization to safely and efficiently update concurrent data
structures. One CAS is the update itself, the other CAS is a version
number that protects the entire data structure, to detect
inconsistency.
Overview/Main Points
Abstractions of interest:
- Type-specific memory: if T* is initially a pointer to T, it is
guaranteed that if T* is non-null, it points to a T and not some
other type of thing, i.e. if non-null it is always safe and
semantically meaningful to dereference it as a T. The property
persists over a time Tstable
(paper assumes Tstable == infinite).
- C++ "new" and "delete" support this at source level
- more efficient implementation of allocation (free lists of
fixed-size objects)
- Double compare and swap (DCAS or CAS2) atomically performs
if (*t1==v1 && *t2==v2) { *t1=u1; *t2=u2; } where t1 and
t2 are TSM
- Minimizing window of inconsistency during atomic update (read
all, recompute new, atomically conditional-write new): authors
use a version
number that covers the entire data structure (e.g. the queue)
Using DCAS and version numbers for shared data structures and signal delivery:
- Rather than modifying a copy of the data structure and atomically
replacing a pointer, use a version number that covers the whole
data structure. If the version number has changed while the new
data was being computed, the atomic update will fail and the
operation can be retried.
- Optimization: some data structures allow descriptor (entry) to be
removed, modified, then reinserted, if the remove and reinsert
are each atomic. This poses a potential problem in case e.g. a
signal arrives for a thread while the thread's descriptor is not
in the queue (window of inconsistency). In Cache Kernel,
signal delivery is best-effort since authors claim you
need higher-level timeout/retry for distributed programs anyway.
- Possible deadlock scenario: signal handler wanting to modify
descriptor D is called on the stack of a thread that has just
removed (but not yet replaced) D; deadlock results. I.e. must be
careful in how retry is implemented.
Comparison to other synchronization (Proteus simulations):
- With low preemption (# threads =~ # procs), comparable to spin
locks and way better than single-CAS implementation (which requires
substantially larger number of instructions to implement).
- With high preemption, somewhat better than single-CAS and way
better than spin-locks (which suffer from priority inversion due
to preemption).
- Simulations modeled a degree of contention atypical of a
well-designed system; authors claim this shows their technique
works well under stress.
- Synthesis Kernel used similar techniques, but doesn't
generalize because the particular in-kernel data structures had
constrained access patterns; e.g. a single toggle bit to protect
a data structure rather than a version number.
- Transactional Memory is a (heavyweight) generalization to n-way
CAS. Authors'
contribution is to demonstrate that 2-way is enough to improve OS
structure using NBS.
Implementation:
- RISC hardware: LLP (load linked pipelined) and SCP (store
cond. pipelined), which "nest" inside LL and SC (both in
pipelining sense and semantic sense.) I.e. LLP is semantically
linked to preceding LL, and SCP is semantically linked to
subsequent SC. Allows DCAS to be implemented with a nested
LL/LLP/SCP/SC pair. Authors claim to have worked out design for
R4000-like pipeline, but omitted for brevity.
- CISC hardware: M68040 implements CAS2, but it's slow.
- Software: Bershad 1993: use single global OS lock to implement
DCAS sequences; if process holding lock is delayed, OS rolls back
the DCAS instructions performed so far, and releases lock.
Problems: contention on single global lock; multiple memory
accesses required in common case; difficult to prove correct if
rollback must undo memory accesses to paged-out or illegal region.
Relevance
A potentially high-performance and robust (can't die while holding a
lock!) way to increase concurrency of critical OS data structures, and
possibly enable efficient implementation of OS services such as signal
delivery, based on a neat hack.
Extension: for more constrained cases, can use a single bit (or
2-bit) version number by exploiting the fact (with proper compiler
support) that addresses are usually word-aligned.
Flaws
- Code is not easy to get right for mere mortals (e.g. signal
handler deadlock scenario above), though authors
claim that "common cases" could be encapsulated in C++ classes
- Proposed LLP/SCP seems hard to implement in hardware, even for
hairy CPU's, and software solutions don't have good enough
performance to make NBS much better than blocking synchronization
Back to index