Back to index

High-Performance Sorting on Networks of Workstations

Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson Summary by: Armando Fox and Steve Gribble

One-line summary: Single- and multi-node parallel sorting on NOW point up both NOW's advantages and some (SPARC-specific?) weaknesses, but hold the current MinuteSort record.

Overview/Main Points

Why NOW: performance isolation (analysis of behavior node-by-node, factor-by-factor; CPU, disk, etc), incremental scalability.
NOW-sort: Split-C (split-phase transactions using AM), uniform key distribution (as in previous work). As in previous work, expect performance to be dominated by I/O, not by in-core sorting time.
New tools developed:
- Diskconf: measure achieved thruput of different disks, to determine ratio of stripe sizes across them.
- Memconf: how much memory available at runtime, after accounting for OS, buffers, daemons, etc.
Buffer management:
- read() does its own (bad) buffer management;
- so does mmap;
- but mmap plus madvise works OK (but see below).
Core sorting algorithms:
- Quicksort
- Bucket+Quicksort: group into buckets on high 32 bits, common case needs to compare only those 32 bits, so a constant factor better than Quicksort. Buckets sized to fit in L2 cache (512KB for Ultras).
- Bucket+Radix Sort: best performance; linear in data size.
One-pass parallel sort:
- Buckets spread across nodes; any processor can deliver keys to any node for bucketing.
- Each processor then sorts its local keys, and then gathers and writes its records to local disk.
- Synchronous: each of the above phases done serially.
- Interleaved: single thread alternates between reading and communicating during key bucketing. Allows some overlap of reading and communicating.
- Threaded: reader and communicator threads are separate. This gave best performance due to finer-grained thread scheduling (so better overlap of disk and network).
- Linear scaling; remote process startup is costly, due to GLUnix.
Two-pass, single-node:
- Create sorted sub-runs on disk, using a reader thread and writer thread.
- Merge sorted sub-runs, using a reader, merger, and writer threads.
- Note: when reading from multiple disk streams in the above step, mmap() does not prefetch enough to amortize seek, so merge phase code explicitly manages prefetching with multiple threads and buffers--ugh.
- Pipelining of phases performs much better than synchronous phases, regardless of number of disks used; but since disk reading and writing are both occurring nearly all the time in pipelined version, must dedicate each disk to either reading or writing but not both in order to get best performance.
- Tradeoff between size of sub-runs created (don't want a single one to be larger than physical memory) and number of runs (merge phase cost is linear in number of runs). Authors found that tradeoff doesn't significantly affect performance until you hit the "memory wall".
Two-pass parallel sort:
- Like two-pass single-node, but buckets distributed across nodes.
- Reader, communicator, and writer threads, to exploit overlap.
- Parallel version on one CPU is slower than single-node version, since all phases must be synchronous in parallel version.
- Since authors' machines are memory-starved, need two passes. (Previous MinuteSort algorithms worked in a single pass.) Result: authors need to do twice as much I/O in the same amount of time, in order to beat standing record.
Observations from optimizing the various sorts:
- UltraSPARC internal SCSI saturates w/more than one disk.
- mmap() and madvise() allow complete overlapping of key bucketing with reading more keys from disk.
- SBUS I/O bandwidth is severely limited--advertises 80MB/s, authors saw about 36MB/s. This means workstation can't effectively use 2 disks and network at the same time!
- (Drastic proposal: communciation facilities that use the memory bus directly without coherence.)
- Need OS facilities that do what diskconf and memconf do.

Relevance

NOW beats enterprise workstations for sorting (typically something that SMP's are better at because of their tightly integrated I/O), even though internal I/O buses are too slow and nodes don't have enough RAM.

Flaws/Lessons

Flaw: this is a benchmark. May not be representative of typical sorting behavior, since much optimization was done to get the benchmark to win.
How hard to write such high-perf code in practice? Should programmers even have to deal with things like memconf and diskconf, even if OS provides them?
Mmap() prefetching and similar behaviors should be more adjustable, to account for situations such as amortizing seek cost for reading from multiple seqential streams.

Back to index