Back to index
Access Path Selection in a Relational DBMS
P.G. Selinger, M.M. Astrahan, D.D> Chamberlin, R.A. Lorie, T.G. Price
Summary by: Steve Gribble and Armando Fox
One-line summary:
Optimize queries by selecting from a tree of possible query step
orderings, by using statistics of existing relations, selectivity
factors that estimate how many tuples will be returned by a
predicate, and appropriately weighted CPU and disk costs of accessing
tuples.
Overview/Main Points
- Degrees of freedom for optimization: Permutations of join orders
and method of joining; evaluation order among query blocks.
- Accessing tuples via RSS:
- Segment scan: scan group of disk pages, which may contain
tuples from many relations; return only those tuples from
a desired relation.
- Index scan (if this relation has an index which maps keys
to tuple-ID's containing that key): scan the leaf pages of
the index B-tree, then get the tuple corresponding to each
TID. The index is clustered if this can be done
without scanning any tuple page more than once.
- Scans can take a set of predicates used to filter out some
tuples. A sargable predicate is one that can be put into
the form "column-name rel-op value", e.g. "Salary < 2000".
- Overall cost being optimized is:
(page fetches) + W * (RSI I/O's), where W is a weighting factor
between CPU and disk.
- Optimizer uses the following stats, which are kept for each
relation T and index I:
- NCARD(T): cardinality of T
- TCARD(T): number of pages that hold tuples of T
- P(T): fraction of data pages in a segment that hold tuples
of T, i.e. TCARD(T)/(no. of pages in segment)
- ICARD(I): number of distinct keys in index
- NINDX(I): number of pages in index
- Selectivity factor: An estimate of what fraction of input
tuples will be returned by a relation. Table 1 shows how to
estimate F for various query elements ("column = value", "column
between val1 and val2", "expr1 OR expr2", etc.). As shown in the
table, if not enough is known about the query element to estimate
its F, a "rule of thumb constant" is used (e.g. 1/10).
- QCARD(Q): Query cardinality is computed as the product of the
cardinalities of the query's FROM list and the selectivity
factors of the query block's Boolean factors.
- Number of expected RSI calls is product of relation cardinalities
and selectivity factors of the sargable predicates.
- Join ordering. A tuple order is interesting if it
is specified by the query. Some queries may specify no
interesting orders (i.e. return results unordered). To optimize
join cost, need to choose between cheapest method that produces
an interesting order and cheapest method that produces unordered
result plus cost of sorting. Note that multiway joins can be
pipelined.
- Heuristic: find best join order for successively larger subsets
of tables. (Justification: once you've joined the first k
relations, joining the result to the k+1st relation is
independent of the order of joining the first k.)
- To reduce search space, only consider join orders that have join
pro=edicates relating the inner relation to other relations
already joined. E.g. if T1 and T2 are joined on column C1,
and T2 and T3 are joined on column C2, then the orderings
{T1 T3 T2} and {T3 T1 T2} will be ignored.
- Cost of nested loop join or merge join is:
(Outer-cost(path1) + N * Inner-cost(path2)), where outer-cost and
inner-cost are the cost of scanning a tuple (applying all
applicable predicates) from the respective paths, and N is the
product of the cardinalities and selectivity factors of all
relations joined so far (i.e. a product-of-products).
- Observation: nested-loop and merge join costs are the same
formula! Merging is sometimes better because inner-cost may be
much less.
Relevance
The seminal work on query optimization relying on table
statistics and a model of CPU and disk utilization.
Flaws
Cost models used were adequate for the hardware of the day, but probably
simplistic given today's multiheaded CPU's and synchronous nonblocking
I/O subsystems. (Compilers have the same problem in trying to optimize
code for hairy new architectures; seems like there would be many
parallels between query optimization and the back end of a compiler.)
Back to index