Practical Considerations for Non-Blocking Concurrent Objects

Bershad

Non-blocking object: can be accessed by many threads at a time; update protocols use atomic compare-and-swap to guarantee consistency instead of critical sections.

Non-blocking compare-and-swap:

Can be built on top of test-and-set or atomic exchange

Non-blocking concurrent objects are good for parallel processing:

processes are not forced to queue on locks—problems such as convoys, priority inversion and deadlock can be avoided. (convoys occur when a process holding a lock is descheduled, and this prevents another process that needs that lock from making forward progress; priority inversion—low priority thread holds a lock that is preventing a high priority thread from making progress because a medium priority thread preempted the low priority thread)

Two-main problems that needs to be solved by implementers of non-blocking concurrent objects:

1) few processors (at the time) supported compare-and-swap which can be used to implement these objects.

2) Performance of non-blocking objects degrades in the presence of contention by many concurrent threads

Compare and swap: (NOT NON-BLOCKING; can block on acquire_lock(); also OS can deschedule in critical section while lock is being held!)

int Compare_And_Swap (int *address, int old_value, int new_value) {

1: acquire_lock(); // BEGIN CRITICAL SECTION

2: if (*address == old_value) {

3: *address = new_value;

4: release_lock(); // END CRITICAL SECTION

5: return SUCCESS;

6: }

else {

7: release_lock();

8: return FAILURE; // END CRITICAL SECTION

9: }

}

Bershad solves the problem with roll out: OS can preempt while in critical section, but lock is released upon preemption. If thread has not already executed the swap, then execution resumes at the beginning (line 1). If line 3 was already executed, code is rolled out until line 5. (Assumption: interrupts are precise—preemption only happens in between instructions, not during them.)

Roll forward is another alternative. In roll forward, at the point of preemption, the Compare and Swap is allowed to complete execution. The disadvantage is that the Compare and Swap completes execution in a different thread. If the thread getting swapped in is a kernel thread, memory protection needs to be ensured so that kernel data structures are not corrupted. Also, if the thread stopped due to a page fault, the page fault may need to be resolved before execution can continue.

Advantage of software-based Compare-And-Swap (CAS): CAS-N in which many values can be compared can be implemented relatively, assuming the appropriate roll out cases can be taken care of.

Two potential disadvantages of roll out: 1) if a processor fails while in CAS, lock may never be let go (but things aren’t usually designed to survive processor failures in most multiprocessors anyway), 2) pathological scheduling could result in always preempting CAS before line 3, and always restarting the operation at line 1, thereby never making forward progress. Roll forward would be more appropriate in this case.

Atomic operations in multiprocessors (requires at least one bus transaction): requires putting a signal on the bus that 1) prevents the other processors from doing atomic operations until the signal is lifted, and 2) potentially invalidate data in local processor caches.

Trend was that cycle times to do atomic operations were growing significantly compared to the average “add” instruction. Hence, frequency of synchronization becomes critical to reduce, and non-blocking objects will cause more overhead for faster processors.

Using Locking Strategies to implement acquire_lock() in CAS:

1) Spinlocks perform poorly.
Performace degrade quickly as more processors execute concurrent operations on the same lock.

2) Software queueing inappropriate for CAS.
Thread 2 waits until Thread 1 lets go of the lock. Thread 3 waits until Thread lets go of the lock. Thread n+1 waits until process n lets go of the lock. Locks are queued such that the I+1^st process will get the lock immediately after the ith process. However, performance still degrades quickly with the number of processors. Reason is because if Thread I gets the lock, and does the compare and swap, all n-I Thread after it will have their compare and swaps fail! (their old_values are now all out of date). Hence for every n Threads queuing, if each Thread takes t cycles, only one successful compare and swap takes place every nt cycles.

3) Have CAS fail sometimes, but reduce the frequency of the failures.
Basically, have each thread poll the old_value before attempting to do the acquire_lock() operation. If the old_value changes before the acquire_lock(), the operation fails (Compare-and-Compare-and-Swap). This way, we don’t have to pay the price for acquiring the lock before failing. This technique reduces the drop in throughput as more processors are added.

Conclusions:

High-throughput non-blocking concurrent objects are good and necessary as the cost for synchronization instructions on processors increases. Paper show how to implement an efficient (high throughput in the face of contention), CACAS primitive that can sometimes fail, but is efficient.