Back to index
Effective Distributed Scheduling of Parallel Workloads
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler
One-line summary:
Implicit scheduling allows each local scheduler in a distributed system to
make independent decisions that have the bulk effect of coordinating the
scheduling of cooperating processes across processors; they show implicit
scheduling is near that of coscheduling without requiring global explicit
coordination.
Overview/Main Points
- coscheduling badness:
- hard to design and implement
- ignores needs ot mixed workloads (I/O intensive or
interactive jobs)
- busy-waiting during I/O, which is cycle-wasteful
- their processing model: bulk-synchronous SPMD
- phases of purely local computation, separated by
barriers
- within a barrier, some cross-processor communication
occurs
- vary mean computation time g, and variation in
computation time v (i.e. imbalance of jobs
across processors)
- examine 3 communication pattens: barrier (no
communication), news (a grid communication pattern,
each process communicates with 4 neighbours), and
transpose (P read phases, on
ith read, process p reads
data from process (p+i) mod P.
- implicit scheduling
- communicating processes dynamically identified and
coordinated. two-phase blocking: waiting
process spins for some predetermined time, and if
response is received continues executing and if not,
process blocks and yields processor. The secret sauce
is in knowing how long to spin-block.
- immediate blocking: bad vs. coscheduling. For
coarse-grained jobs (large g), does ok. For
fine-grained jobs (small g), does pitifully
because of large context-switch and idle processor
time. Up to factor of 14 worse than coscheduling for
transpose pattern.
- static blocking:
- spin time = context switch time: does much
better, about factor of 3-5 slowdown from
coscheduling. Much time is now spent
spinning on barrier, little time idle or
context-switching. This counts on fact
that process is woken with high-priority by
scheduler when a message arrives for it.
- spin time = 2 x context switch time: does
significantly better, about factor of 1-1.2
slowdown. Argue that a skew of 2 x context
switch time is introduced by distributed
scheduling when returning from barriers, and
that waiting at least this smooths over
scheduling irregularities. One bad case: if
variation is higher than 2 x CS, then get
spikes of bad performance because end up not
waiting long enough.
- Adaptive algos:
- Load-imbalance oracle: there is a
minimum spin-time S that ensures
coordinated processes remain coordinated, and
a load imbalance V past which it is
more beneficial to block at barrier than to
spin until barrier completes. We know S is
at least 2 x CS. Some simple cost analysis
shows that V is 10 x CS. For oracle
(perfect knowledge of imbalance), get to
within 1.2 of coscheduling.
- Load-imbalance approx: can approximate
knowledge of load imbalance with to measure
wait-times and remove outliers due to
scheduling irregularities (e.g. interactive
job got scheduled). Does worse than oracle
for fine-grained programs because
approximation of load-imbalance is worse than
actual one, causing process to not sping-wait
long enough. Low approx because some valid
wait times are thrown out by removing
outliers.
- Global approx: their barrier
implementation assumes a barrier server which
does a global notification on barrier
completion. That barrier server can
explicitly calculate load imbalance and
provide that figure along with barrier
completion notifications. Does pretty mch
the same as oracle.
- Sensitivity to scheduler
- If timers in local schedulers are synchronized across
processors, adaptive blocking algorithm gets better
(from 1.3x to 1.15x worse than coscheduling).
- If round-robin scheduling is used (instead of priority
sched. with processes receiving messages getting
immediate boost and scheduling), performance dives to
3.4x worse than coscheduling.
Relevance
Excellent way to eliminate complexity when coscheduling parallel jobs.
Implicit information is clearly cheap (free!) to obtain and nearly always
right. Sounds a heck of a lot like the BASE philosophy. We should
remember this and talk about it when we chat with Steve McCanne about
soft-state protocols.
Flaws
- Dependence on priority scheduler that immediately schedules a
process after receiving message is a little fishy, but I guess I
can live with it.
- Seems extremely sensitive to perturbations in the system - what
if very heavy non-parallel-job workloads are on the system? What
if the communication latencies across processors is high and
variable? These things will blow the validity of implicit
information.
Back to index