Effective Distributed Scheduling of Parallel Workloads

Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler

One-line summary: Implicit scheduling allows each local scheduler in a distributed system to make independent decisions that have the bulk effect of coordinating the scheduling of cooperating processes across processors; they show implicit scheduling is near that of coscheduling without requiring global explicit coordination.

Overview/Main Points

coscheduling badness:
- hard to design and implement
- ignores needs ot mixed workloads (I/O intensive or interactive jobs)
- busy-waiting during I/O, which is cycle-wasteful
their processing model: bulk-synchronous SPMD
- phases of purely local computation, separated by barriers
- within a barrier, some cross-processor communication occurs
- vary mean computation time g, and variation in computation time v (i.e. imbalance of jobs across processors)
- examine 3 communication pattens: barrier (no communication), news (a grid communication pattern, each process communicates with 4 neighbours), and transpose (P read phases, on ith read, process p reads data from process (p+i) mod P.
implicit scheduling
- communicating processes dynamically identified and coordinated. two-phase blocking: waiting process spins for some predetermined time, and if response is received continues executing and if not, process blocks and yields processor. The secret sauce is in knowing how long to spin-block.
- immediate blocking: bad vs. coscheduling. For coarse-grained jobs (large g), does ok. For fine-grained jobs (small g), does pitifully because of large context-switch and idle processor time. Up to factor of 14 worse than coscheduling for transpose pattern.
- static blocking:
  - spin time = context switch time: does much better, about factor of 3-5 slowdown from coscheduling. Much time is now spent spinning on barrier, little time idle or context-switching. This counts on fact that process is woken with high-priority by scheduler when a message arrives for it.
  - spin time = 2 x context switch time: does significantly better, about factor of 1-1.2 slowdown. Argue that a skew of 2 x context switch time is introduced by distributed scheduling when returning from barriers, and that waiting at least this smooths over scheduling irregularities. One bad case: if variation is higher than 2 x CS, then get spikes of bad performance because end up not waiting long enough.
- Adaptive algos:
  
  Load-imbalance oracle: there is a minimum spin-time S that ensures coordinated processes remain coordinated, and a load imbalance V past which it is more beneficial to block at barrier than to spin until barrier completes. We know S is at least 2 x CS. Some simple cost analysis shows that V is 10 x CS. For oracle (perfect knowledge of imbalance), get to within 1.2 of coscheduling.
  Load-imbalance approx: can approximate knowledge of load imbalance with to measure wait-times and remove outliers due to scheduling irregularities (e.g. interactive job got scheduled). Does worse than oracle for fine-grained programs because approximation of load-imbalance is worse than actual one, causing process to not sping-wait long enough. Low approx because some valid wait times are thrown out by removing outliers.
  Global approx: their barrier implementation assumes a barrier server which does a global notification on barrier completion. That barrier server can explicitly calculate load imbalance and provide that figure along with barrier completion notifications. Does pretty mch the same as oracle.
Sensitivity to scheduler

If timers in local schedulers are synchronized across processors, adaptive blocking algorithm gets better (from 1.3x to 1.15x worse than coscheduling).
If round-robin scheduling is used (instead of priority sched. with processes receiving messages getting immediate boost and scheduling), performance dives to 3.4x worse than coscheduling.

Relevance
Excellent way to eliminate complexity when coscheduling parallel jobs. Implicit information is clearly cheap (free!) to obtain and nearly always right. Sounds a heck of a lot like the BASE philosophy. We should remember this and talk about it when we chat with Steve McCanne about soft-state protocols.
Flaws

Dependence on priority scheduler that immediately schedules a process after receiving message is a little fishy, but I guess I can live with it.
Seems extremely sensitive to perturbations in the system - what if very heavy non-parallel-job workloads are on the system? What if the communication latencies across processors is high and variable? These things will blow the validity of implicit information.

Back to index