Back to index
Linear Clustering of Objects with Multiple Attributes
H.V. Jagadish, Bell Labs
Summary by Armando Fox and Steve Gribble
One-line summary:
Survey of various schemes for
mapping multidimensional data into a linear space, with investigation of
locality-preserving properties for column-based and range queries.
Overview/Main Points
- Methods for mapping 2D to 1D:
- Column scan
- Snake scan: reverse direction of scan on alternate
columns, so last element of column 2n and last element of
column 2n+1 stay adjacent
- Z-curve scan: obtained by interleaving bits of binary
representation X,Y coords; scanning order is LL, UL, LR,
UR, recursively.
- Gray coding: like z-curve, but Gray-code to obtain 1D
coord.
- Hilbert curve: see fig. 4 in paper; hard to describe but
easy to see.
- "Goodness" metrics: number of disk blocks fetched, and number of
non-consecutive block pairs in fetch.
- Query types: <*,value> or <value,*>, i.e. horiz. or
vertical line in 2-space; or <Rx,Ry> where Ri is a range on
dimension i, i.e. a rectangle in 2-space.
- expected number of continuous runs to
complete row or column query:
- column scan sucks except when a column is
picked
- snake slightly better for horiz. lines near "top" or
"bottom" rows;
- z-curve and Hilbert curve least affected by
horiz. vs. vertical
- No "clear winner" among z, Hilbert, snake
- expected number of continuous runs to complete rectangle query:
all do almost equally well, with snake doing a bit better! (see
tables below)
- expected linear span of mapping a rectangle: O(perimeter) for all
mappings
- So why use complicated z/Gray/Hilbert instead of simple snake
mapping? Answer: if multiple 2D grid points can be mapped to
single disk block, analysis changes.
- Experimental results ("simulations"? hard to say) show that
"locality at every level" (my phrase) in z, Hilbert, Gray makes
them do fewer I/O's to complete query than snake, with Hilbert
and Gray usually doing best.
- Insight: for all except snake, number of hits per block to fetch
a selection is proportional to size of selected size, but
independent of size of database. Snake is always
proportional to size of database since columns are scanned
contiguously.
- Same insight explains superior performance over snake w/r/t fraction of
blocks read sequentially and hits per block.
- Extensions to multiple dimensions: 3D "primitives" for Hilbert
and z/Gray are shown.
- Extensions to non-square regions: replicate ("tile") pattern to
map the area, or have a primitive whose interpoint distances are
non-invariant w/r/t rotation. E.g. fig 18, last page of paper.
- Previous work result: k-attribute selection on database
with N records has disk access cost
O(N(k-1)/k); but
this paper shows that for specific classes of queries we can do
much better, in fact O(k) independent of N for some cases.
Relevance
Concrete and practical "goodness metrics" for locality ("clustering") in
database sense (i.e. "few disk I/O's"), and of how to preserve locality
inherent in a multiple-dimension representation while
mapping down to a smaller number of dimensions.
Flaws
- "Experimental results" - seems like they could have just been
computed from first principles, if the distribution of request
sizes and parameters was known or could be assumed.
Back to index