Back to index

The Gamma Database Machine Project

David J. De Witt, Shahram, Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-I Hsao, Rick Rasmussen, Univ. Wisconsin-Madison. Summary by Armando Fox and Steve Gribble

One-line summary: Distributed DB on top of a NOW, to exploit incremental scalability, commodity parts, and high aggregate disk bandwidth.

Overview/Main Points

Shared-nothing architecture (i.e. a NOW), for today's NOW reasons (in 1988!): commodity parts, high aggregate I/O BW to disks. First version: Vax 11/750's; too little memory per node (2MB) and I/O bottleneck. Second version: Intel iPSC hypercube, DMA-supported interconnect.
Horizontal partitioning of relations across nodes (declustering) on user-selected index. In retrospect, this was a big mistake--should use "heat" of a relation (hot spot patterns?) to determine how to decluster it.
Processes: catalog mgr (schema info), scheduler (for multisite queries), operator (per-node controllers).
Use hashes (specifically, a split table) to partition execution of query plan steps across nodes. Hash is applied (e.g.) to join attribute of tuples entering a join operation.
Similarly, scheduler process can pipeline different phases of an operation by initiating them on different processors and rendezvousing the results. Example: to do a simple hash join of relation B with A, simultaneously initiate the build phase (in which tuples from inner relation A are inserted into a hash table; preparation for the probing phase, in which tuples from B are looked up in the hash table for matches) and the selection operator on A. When both complete, move on to the probing phase.
Underlying OS: NOSE (lightweight processes, shared memory, nonpreemptive scheduling, message passing between processes or processors on the hypercube)
Operator implementations:
- Select: collect relation rows from across several nodes.
- Join: Partition original relations into disjoint subsets (buckets) using hash fn. E.g. parallel hybrid hash join: partition inner relation into N buckets; distribute buckets to different nodes, each of which does a sub-join. (In traditional HHJ, only the first bucket would participate in sub-join, while remaining N-1 buckets stored in temp files.)
- Scalar Aggregates: each processor computes its piece of result in parallel.
- Update: modified tuple is passed thru split table, since it may now live on a different partition.
Concurrency control: traditional intentional locking at file and page level; centralized deadlock detector; ARIES WAL and recovery.
Chained declustering for availability: each disk holds the primary copy of tuple bucket k and the backup copy of bucket k+1 (mod the number of buckets).
Performance: Close to linear (but not perfectly linear) speedup and response latencies as processors are added for the measured workload. Disk and network overheads prevent performance from being perfectly linear. Also, for selections that end up selecting a small number of tuples, overhead to set up the selection dominates execution time and results in sublinear speedup.

Relevance

Early scalable distributed system on a NOW, with a lot of the same motivations and impementation choices and similar results. Today's fast networks might alleviate some of the scaling problems the authors saw (and to a lesser extent, improved disk performance would help as well).

Flaws

Not surprising that a balanced system would require accounting for the high startup costs of disk operations. Aggregate disk bandwidth is not the problem: initiation and similar latencies are the problem. Surprising that the authors didn't deal with this up front.
As authors acknowledge, it's not necessarily always a good idea to horizontally split relations across disks (pessimal for some transactions).

Back to index