Access Path Selection in a Relational DBMS

P.G. Selinger, M.M. Astrahan, D.D> Chamberlin, R.A. Lorie, T.G. Price

Summary by: Steve Gribble and Armando Fox

One-line summary: Optimize queries by selecting from a tree of possible query step orderings, by using statistics of existing relations, selectivity factors that estimate how many tuples will be returned by a predicate, and appropriately weighted CPU and disk costs of accessing tuples.

Overview/Main Points

Degrees of freedom for optimization: Permutations of join orders and method of joining; evaluation order among query blocks.
Accessing tuples via RSS:
- Segment scan: scan group of disk pages, which may contain tuples from many relations; return only those tuples from a desired relation.
- Index scan (if this relation has an index which maps keys to tuple-ID's containing that key): scan the leaf pages of the index B-tree, then get the tuple corresponding to each TID. The index is clustered if this can be done without scanning any tuple page more than once.
Scans can take a set of predicates used to filter out some tuples. A sargable predicate is one that can be put into the form "column-name rel-op value", e.g. "Salary < 2000".
Overall cost being optimized is:
(page fetches) + W * (RSI I/O's), where W is a weighting factor between CPU and disk.
Optimizer uses the following stats, which are kept for each relation T and index I:
- NCARD(T): cardinality of T
- TCARD(T): number of pages that hold tuples of T
- P(T): fraction of data pages in a segment that hold tuples of T, i.e. TCARD(T)/(no. of pages in segment)
- ICARD(I): number of distinct keys in index
- NINDX(I): number of pages in index
Selectivity factor: An estimate of what fraction of input tuples will be returned by a relation. Table 1 shows how to estimate F for various query elements ("column = value", "column between val1 and val2", "expr1 OR expr2", etc.). As shown in the table, if not enough is known about the query element to estimate its F, a "rule of thumb constant" is used (e.g. 1/10).
QCARD(Q): Query cardinality is computed as the product of the cardinalities of the query's FROM list and the selectivity factors of the query block's Boolean factors.
Number of expected RSI calls is product of relation cardinalities and selectivity factors of the sargable predicates.
Join ordering. A tuple order is interesting if it is specified by the query. Some queries may specify no interesting orders (i.e. return results unordered). To optimize join cost, need to choose between cheapest method that produces an interesting order and cheapest method that produces unordered result plus cost of sorting. Note that multiway joins can be pipelined.
Heuristic: find best join order for successively larger subsets of tables. (Justification: once you've joined the first k relations, joining the result to the k+1st relation is independent of the order of joining the first k.)
To reduce search space, only consider join orders that have join pro=edicates relating the inner relation to other relations already joined. E.g. if T1 and T2 are joined on column C1, and T2 and T3 are joined on column C2, then the orderings {T1 T3 T2} and {T3 T1 T2} will be ignored.
Cost of nested loop join or merge join is:
(Outer-cost(path1) + N * Inner-cost(path2)), where outer-cost and inner-cost are the cost of scanning a tuple (applying all applicable predicates) from the respective paths, and N is the product of the cardinalities and selectivity factors of all relations joined so far (i.e. a product-of-products).
Observation: nested-loop and merge join costs are the same formula! Merging is sometimes better because inner-cost may be much less.

Relevance

The seminal work on query optimization relying on table statistics and a model of CPU and disk utilization.

Flaws

Cost models used were adequate for the hardware of the day, but probably simplistic given today's multiheaded CPU's and synchronous nonblocking I/O subsystems. (Compilers have the same problem in trying to optimize code for hairy new architectures; seems like there would be many parallels between query optimization and the back end of a compiler.)

Back to index