Back to index

# Linear Clustering of Objects with Multiple Attributes

Summary by Armando Fox and Steve Gribble

One-line summary: Survey of various schemes for mapping multidimensional data into a linear space, with investigation of locality-preserving properties for column-based and range queries.

## Overview/Main Points

• Methods for mapping 2D to 1D:
• Column scan
• Snake scan: reverse direction of scan on alternate columns, so last element of column 2n and last element of column 2n+1 stay adjacent
• Z-curve scan: obtained by interleaving bits of binary representation X,Y coords; scanning order is LL, UL, LR, UR, recursively.
• Gray coding: like z-curve, but Gray-code to obtain 1D coord.
• Hilbert curve: see fig. 4 in paper; hard to describe but easy to see.
• "Goodness" metrics: number of disk blocks fetched, and number of non-consecutive block pairs in fetch.
• Query types: <*,value> or <value,*>, i.e. horiz. or vertical line in 2-space; or <Rx,Ry> where Ri is a range on dimension i, i.e. a rectangle in 2-space.
• expected number of continuous runs to complete row or column query:
• column scan sucks except when a column is picked
• snake slightly better for horiz. lines near "top" or "bottom" rows;
• z-curve and Hilbert curve least affected by horiz. vs. vertical
• No "clear winner" among z, Hilbert, snake
• expected number of continuous runs to complete rectangle query: all do almost equally well, with snake doing a bit better! (see tables below)
• expected linear span of mapping a rectangle: O(perimeter) for all mappings
• So why use complicated z/Gray/Hilbert instead of simple snake mapping? Answer: if multiple 2D grid points can be mapped to single disk block, analysis changes.
• Experimental results ("simulations"? hard to say) show that "locality at every level" (my phrase) in z, Hilbert, Gray makes them do fewer I/O's to complete query than snake, with Hilbert and Gray usually doing best.
• Insight: for all except snake, number of hits per block to fetch a selection is proportional to size of selected size, but independent of size of database. Snake is always proportional to size of database since columns are scanned contiguously.
• Same insight explains superior performance over snake w/r/t fraction of blocks read sequentially and hits per block.
• Extensions to multiple dimensions: 3D "primitives" for Hilbert and z/Gray are shown.
• Extensions to non-square regions: replicate ("tile") pattern to map the area, or have a primitive whose interpoint distances are non-invariant w/r/t rotation. E.g. fig 18, last page of paper.
• Previous work result: k-attribute selection on database with N records has disk access cost O(N(k-1)/k); but this paper shows that for specific classes of queries we can do much better, in fact O(k) independent of N for some cases.

## Relevance

Concrete and practical "goodness metrics" for locality ("clustering") in database sense (i.e. "few disk I/O's"), and of how to preserve locality inherent in a multiple-dimension representation while mapping down to a smaller number of dimensions.

## Flaws

• "Experimental results" - seems like they could have just been computed from first principles, if the distribution of request sizes and parameters was known or could be assumed.

Back to index