Generalized Search Trees (GiST) for Database Systems

J. M. Hellerstein, J.F. Naughton, A. Pfeffer

One-line summary: GiST's are like R-trees with virtual methods, so they can support a variety of data types (extensible); special-case optimizations can be included to increase performance with simple linear-range data.

Overview/Main Points

Search key: any arbitrary predicate that holds for each datum below the key.
Search tree: hierarchy of partitions of a dataset, in which each partition has categorization that holds for all data in the partition.
Unlike R-trees, don't require that p --> q (p is a predicate below node N, q is a predicate at node N). R-trees are overly restrictive since this is in fact the case, so we might do better by allowing the lower predicates to be some property logically orthogonal to the upper predicates.
Methods you need to "override":
1. Consistent(E,q): given E=(p,ptr), return false if (p^q) can be guaranteed unsatisfiable (i.e. datum cannot be in this subtree).
2. Union(E1,E2,...,En): return some predicate that holds for all tuples stored below E1 thru En. E.g. find an r such that (p1 OR p2 OR ... OR pN) --> r.
3. Compress
4. Decompress
5. Penalty(E1,E2): domain-specific penalty to insert E2 into subtree rooted at E1, e.g. the "area penalty" from R-trees.
6. PickSplit(P): split P into two sets of entries each of size at least kM. Tree minimum fill factor controlled here.
Specialization routines FindMin and numeric compares can be used to optimize behavior for linearly-ordered domains.
Issues: key overlap may occur either because of data overlap or because key compression destroys distinguishing information (e.g. bounding boxes overlap even though objects don't).
Hot spot: specific predicate satisfiable by many tuples; correlation factor measures the likelihood that (p OR q) --> (p^q). How does GiST behave with hot spots? Unknown.
Future work: indexability--is a given collection of data practically "indexable" for a given set of queries? (need an "indexability theory"); indexing nonstandard domains; query optimization should account for (not-well-defined) cost of searching a GiST; lossy key compression techniques.

Relevance

Unifies B-trees, R-trees, and others into a generalized extensible structure. (Shades of C++ come to mind.) Also the only publication I know to correctly use the singular "spaghetto".

Flaws

Extensibility: two examples were pre-existing, third was closely related to R-trees (set indexing). A "way out" example might have been interesting. Evidently lots of work yet to be done to flesh this stuff out (as advertised in class!).

Back to index