Back to index
The Gamma Database Machine Project
David J. De Witt, Shahram, Ghandeharizadeh, Donovan A. Schneider, Allan
Bricker, Hui-I Hsao, Rick Rasmussen, Univ. Wisconsin-Madison.
Summary by Armando Fox and Steve Gribble
One-line summary:
Distributed DB on top of a NOW, to exploit incremental scalability,
commodity parts, and high aggregate disk bandwidth.
Overview/Main Points
- Shared-nothing architecture (i.e. a NOW), for today's NOW reasons
(in 1988!): commodity parts, high aggregate I/O BW to
disks. First version: Vax
11/750's; too little memory per node (2MB) and I/O bottleneck.
Second version: Intel iPSC hypercube, DMA-supported interconnect.
- Horizontal partitioning of relations across nodes
(declustering) on
user-selected index. In retrospect, this was a big
mistake--should use "heat" of a relation (hot spot patterns?) to
determine how to decluster it.
- Processes: catalog mgr (schema info), scheduler (for multisite
queries), operator (per-node controllers).
- Use hashes (specifically, a split table) to partition
execution of query plan steps across nodes. Hash is applied
(e.g.) to join attribute of tuples entering a join operation.
- Similarly, scheduler process can pipeline different phases of an
operation by initiating them on different processors and
rendezvousing the results. Example: to do a simple hash join of
relation B
with A, simultaneously initiate the build phase (in which
tuples from inner relation A are inserted into a hash table;
preparation for the probing phase, in which tuples from B are
looked up in the hash table for matches) and the selection
operator on A. When both complete, move on to the probing phase.
- Underlying OS: NOSE (lightweight processes, shared memory,
nonpreemptive scheduling, message passing between processes or
processors on the hypercube)
- Operator implementations:
- Select: collect relation rows from across several nodes.
- Join: Partition original relations into disjoint subsets
(buckets) using hash fn. E.g. parallel hybrid hash join:
partition inner relation into N buckets; distribute
buckets to different nodes, each of which does a
sub-join. (In traditional HHJ, only the first bucket
would participate in sub-join, while remaining N-1 buckets
stored in temp files.)
- Scalar Aggregates: each processor computes its piece of
result in parallel.
- Update: modified tuple is passed thru split table, since
it may now live on a different partition.
- Concurrency control: traditional intentional locking at file and
page level; centralized deadlock detector; ARIES WAL and
recovery.
- Chained declustering for availability: each disk holds the
primary copy of tuple bucket k and the backup copy of
bucket k+1 (mod the number of buckets).
- Performance: Close to linear (but not perfectly linear)
speedup and response latencies as processors are added for the
measured workload. Disk and network overheads prevent
performance from being perfectly linear. Also, for selections
that end up selecting a small number of tuples, overhead to set
up the selection dominates execution time and results in
sublinear speedup.
Relevance
Early scalable distributed system on a NOW, with a lot of the same
motivations and impementation choices and similar results. Today's fast
networks might alleviate some of the scaling problems the authors saw
(and to a lesser extent, improved disk performance would help as well).
Flaws
- Not surprising that a balanced system would require accounting
for the high startup costs of disk operations. Aggregate disk
bandwidth is not the problem: initiation and similar
latencies are the problem. Surprising that the authors
didn't deal with this up front.
- As authors acknowledge, it's not necessarily always a good idea
to horizontally split relations across disks (pessimal
for some transactions).
Back to index