Back to index
Generalized Search Trees (GiST) for Database Systems
J. M. Hellerstein, J.F. Naughton, A. Pfeffer
One-line summary:
GiST's are like R-trees with virtual methods, so they can support a
variety of data types (extensible); special-case optimizations can be
included to increase performance with simple linear-range data.
Overview/Main Points
- Search key: any arbitrary predicate that holds for each datum
below the key.
- Search tree: hierarchy of partitions of a dataset, in which each
partition has categorization that holds for all data in the
partition.
- Unlike R-trees, don't require that p --> q (p is a predicate
below node N, q is a predicate at node N). R-trees are overly
restrictive since this is in fact the case, so we might do better
by allowing the lower predicates to be some property logically
orthogonal to the upper predicates.
- Methods you need to "override":
- Consistent(E,q): given E=(p,ptr), return false if (p^q) can be
guaranteed unsatisfiable (i.e. datum cannot be in this subtree).
- Union(E1,E2,...,En): return some predicate that holds for
all tuples stored below E1 thru En. E.g. find an r such
that (p1 OR p2 OR ... OR pN) --> r.
- Compress
- Decompress
- Penalty(E1,E2): domain-specific penalty to insert E2 into
subtree rooted at E1, e.g. the "area penalty" from
R-trees.
- PickSplit(P): split P into two sets of entries each of
size at least kM. Tree minimum fill factor controlled
here.
- Specialization routines FindMin and numeric compares can be used
to optimize behavior for linearly-ordered domains.
- Issues: key overlap may occur either because of data overlap or
because key compression destroys distinguishing information
(e.g. bounding boxes overlap even though objects don't).
- Hot spot: specific predicate satisfiable by many tuples;
correlation factor measures the likelihood that (p OR
q) --> (p^q). How does GiST behave with hot spots?
Unknown.
- Future work: indexability--is a given collection of data
practically "indexable" for a given set of queries? (need an
"indexability theory"); indexing nonstandard domains; query
optimization should account for (not-well-defined) cost of
searching a GiST; lossy key compression techniques.
Relevance
Unifies B-trees, R-trees, and others into a generalized extensible
structure. (Shades of C++ come to mind.) Also the only publication I
know to correctly use the singular "spaghetto".
Flaws
Extensibility: two examples were pre-existing, third was closely related
to R-trees (set indexing). A "way out" example might have been
interesting. Evidently lots of work yet to be done to flesh this stuff
out (as advertised in
class!).
Back to index