Linear Clustering of Objects with Multiple Attributes

H.V. Jagadish, Bell Labs

Summary by Armando Fox and Steve Gribble

One-line summary: Survey of various schemes for mapping multidimensional data into a linear space, with investigation of locality-preserving properties for column-based and range queries.

Overview/Main Points

Methods for mapping 2D to 1D:
- Column scan
- Snake scan: reverse direction of scan on alternate columns, so last element of column 2n and last element of column 2n+1 stay adjacent
- Z-curve scan: obtained by interleaving bits of binary representation X,Y coords; scanning order is LL, UL, LR, UR, recursively.
- Gray coding: like z-curve, but Gray-code to obtain 1D coord.
- Hilbert curve: see fig. 4 in paper; hard to describe but easy to see.
"Goodness" metrics: number of disk blocks fetched, and number of non-consecutive block pairs in fetch.
Query types: <*,value> or <value,*>, i.e. horiz. or vertical line in 2-space; or <Rx,Ry> where Ri is a range on dimension i, i.e. a rectangle in 2-space.
expected number of continuous runs to complete row or column query:
- column scan sucks except when a column is picked
- snake slightly better for horiz. lines near "top" or "bottom" rows;
- z-curve and Hilbert curve least affected by horiz. vs. vertical
- No "clear winner" among z, Hilbert, snake
expected number of continuous runs to complete rectangle query: all do almost equally well, with snake doing a bit better! (see tables below)
expected linear span of mapping a rectangle: O(perimeter) for all mappings
So why use complicated z/Gray/Hilbert instead of simple snake mapping? Answer: if multiple 2D grid points can be mapped to single disk block, analysis changes.
Experimental results ("simulations"? hard to say) show that "locality at every level" (my phrase) in z, Hilbert, Gray makes them do fewer I/O's to complete query than snake, with Hilbert and Gray usually doing best.
Insight: for all except snake, number of hits per block to fetch a selection is proportional to size of selected size, but independent of size of database. Snake is always proportional to size of database since columns are scanned contiguously.
Same insight explains superior performance over snake w/r/t fraction of blocks read sequentially and hits per block.
Extensions to multiple dimensions: 3D "primitives" for Hilbert and z/Gray are shown.
Extensions to non-square regions: replicate ("tile") pattern to map the area, or have a primitive whose interpoint distances are non-invariant w/r/t rotation. E.g. fig 18, last page of paper.
Previous work result: k-attribute selection on database with N records has disk access cost O(N^(k-1)/k); but this paper shows that for specific classes of queries we can do much better, in fact O(k) independent of N for some cases.

Relevance

Concrete and practical "goodness metrics" for locality ("clustering") in database sense (i.e. "few disk I/O's"), and of how to preserve locality inherent in a multiple-dimension representation while mapping down to a smaller number of dimensions.

Flaws

"Experimental results" - seems like they could have just been computed from first principles, if the distribution of request sizes and parameters was known or could be assumed.

Back to index