Back to index
High-Performance Sorting on Networks of Workstations
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler,
Joseph M. Hellerstein, David A. Patterson
Summary by: Armando Fox and Steve Gribble
One-line summary:
Single- and multi-node parallel sorting on NOW point up both NOW's
advantages and some (SPARC-specific?) weaknesses, but hold the current
MinuteSort record.
Overview/Main Points
- Why NOW: performance isolation (analysis of behavior
node-by-node, factor-by-factor; CPU, disk, etc), incremental
scalability.
- NOW-sort: Split-C (split-phase transactions using AM), uniform
key distribution (as in previous work). As in previous work,
expect performance to be dominated by I/O, not by in-core sorting
time.
- New tools developed:
- Diskconf: measure achieved thruput of different
disks, to determine ratio of stripe sizes across them.
- Memconf: how much memory available at runtime,
after accounting for OS, buffers, daemons, etc.
- Buffer management:
- read() does its own (bad) buffer management;
- so does mmap;
- but mmap plus madvise works OK (but see
below).
- Core sorting algorithms:
- Quicksort
- Bucket+Quicksort: group into buckets on high 32 bits,
common case needs to compare only those 32 bits, so a
constant factor better than Quicksort. Buckets sized to
fit in L2 cache (512KB for Ultras).
- Bucket+Radix Sort: best performance; linear in data
size.
- One-pass parallel sort:
- Buckets spread across nodes; any processor can deliver
keys to any node for bucketing.
- Each processor then sorts its local keys, and then gathers
and writes its records to local disk.
- Synchronous: each of the above phases done serially.
- Interleaved: single thread alternates between reading and
communicating during key bucketing. Allows some overlap
of reading and communicating.
- Threaded: reader and communicator threads are separate.
This gave best performance due to finer-grained thread
scheduling (so better overlap of disk and network).
- Linear scaling; remote process startup is costly, due to
GLUnix.
- Two-pass, single-node:
- Create sorted sub-runs on disk, using a reader thread and
writer thread.
- Merge sorted sub-runs, using a reader, merger, and writer
threads.
- Note: when reading from multiple disk streams in the above
step, mmap() does not prefetch enough to amortize seek, so
merge phase code explicitly manages prefetching with
multiple threads and buffers--ugh.
- Pipelining of phases performs much better than synchronous
phases, regardless of number of disks used; but since
disk reading and writing are
both occurring nearly all the time in pipelined version,
must dedicate each
disk to either reading or writing but not both in order to
get best performance.
- Tradeoff between size of sub-runs created (don't want a
single one to be larger than physical memory) and number
of runs (merge phase cost is linear in number of runs).
Authors found that tradeoff doesn't significantly affect
performance until you hit the "memory wall".
- Two-pass parallel sort:
- Like two-pass single-node, but buckets distributed across
nodes.
- Reader, communicator, and writer threads, to exploit overlap.
- Parallel version on one CPU is slower than single-node
version, since all phases must be synchronous in parallel
version.
- Since authors' machines are memory-starved, need two
passes. (Previous MinuteSort algorithms worked in a single
pass.) Result: authors need to do twice as much I/O in
the same amount of time, in order to beat standing record.
- Observations from optimizing the various sorts:
- UltraSPARC internal SCSI saturates w/more than one disk.
- mmap() and madvise() allow complete overlapping of key
bucketing with reading more keys from disk.
- SBUS I/O bandwidth is severely limited--advertises 80MB/s,
authors saw about 36MB/s. This means workstation can't
effectively use 2 disks and network at the same time!
- (Drastic proposal: communciation facilities that use the
memory bus directly without coherence.)
- Need OS facilities that do what diskconf and
memconf do.
Relevance
NOW beats enterprise workstations for sorting (typically something that
SMP's are better at because of their tightly integrated I/O), even
though internal I/O buses are too slow and nodes don't have enough RAM.
Flaws/Lessons
- Flaw: this is a benchmark. May not be representative of typical
sorting behavior, since much optimization was done to get the
benchmark to win.
- How hard to write such high-perf code in practice? Should
programmers even have to deal with things like memconf and
diskconf, even if OS provides them?
- Mmap() prefetching and similar behaviors should be more
adjustable, to account for situations such as amortizing seek
cost for reading from multiple seqential streams.
Back to index