Chapter 11 Coulouris - Coordination and Agreement

Assumptions:

* reliable channels – messages are eventually delivered to their intended recipients, even in the face of router failure

Failure Detectors

- can do either by querying a process to see if it is still responding

- or, a local failure detector can report when a node has failed

- failure detectors can be unreliable/imperfect: they report a process is either suspect or unsuspected of failure. it works simply by sending a message to a process and waiting a timeout period to see if it responds. what is the right timeout value? (especially for an asynchronous system?) if a process respond, just set the timeout period to be the delay time experienced.

- a reliable failure detector reports either unsuspected or failure

Distributed mutual exclusion: multiple processes would like to enter a critical section, and we only want one of the distributed processes to be in a critical section at a given time.

Central server algorithm: processes wanting the enter their critical sections send a message to the server, and the server only gives the lock to one client process. it does not reply to other processes also wanting the enter the critical section until after the first client process is done with the lock. it then gives the lock to the next client process waiting for it.

Ring based algorithm: arrange processes in a ring, and pass around a token; only a process with the token can enter the critical section. If a process wants to enter a critical section, it has to wait until it obtains the token. it holds onto the token until it is done executing the critical section. in the case that no process wants to enter a critical section, the token just keeps getting passed around the members of the ring.

Multicast algorithm: If a process wants to enter a critical section, it broadcasts an “entry” request, and every process is in the state RELEASED, HELD, or WANTED. If every other process responds RELEASED, a process can enter the critical section. If a process is currently in the HELD state (in its critical section), it should not respond until it changes to the RELEASED state.

Maekawa voting algorithm: don’t need to get a RELEASED response from all other processes.... just need to get positive (RELEASED) votes back from a subset where the subsets used by any two processes overlap. Maekawa calculated the optimal size for the subsets.

Fault-tolerance: what is the impact of lost messages and process crashes to the algorithms above?

Election algorithms: designed to answer questions like “who should be the central server in the central server mutual exclusion algorithm above?”

Ring-based election: up to three rounds: round 1 every process send max(received pid, its own id) around the ring, round 2 process with highest pid gets his own pid back and elects himself, round 3 elected process id goes around the ring. if a process dies in the ring, the game is over

Bully algorithm: (due to Hector) deals gracefully with failures. three types of messages: election, answer, coordinator.

[ INCOMPLETE ]