Back to index
Optimization of Nonrecursive Queries
R. Krishnamurthy, H. Boral, C. Zaniolo, MCC
Summary by Armando Fox and Steve Gribble
One-line summary: In executing a rooted join tree, sub-joins
should be done in rank order (to the extent that the reordering
respects the partial ordering inherent in the tree). Authors show how
to compute join cost based on rank for such PPT's (pipelined processing
query trees, i.e n-ary join bushy trees) and LPT (linear processing
trees, i.e. left or right deep trees).
Overview/Main Points
- Observation: it's more important for an optimizer to avoid the
worst execution than to find the best one, since for many
queries, pathological inputs or datasets could cause the worst
case behavior to be orders of magnitude worse than the average
case.
- Authors work with join trees: queries where the join graph
is acyclic. For a given rooted join tree, each relation has a
unique pareant and a well-defined join selectivity (see Selinger) with respect
to the parent. (The selectivity of the root is defined to be 1.)
- For nested-loop and selective-access join, the cost function of
join(R1,R2) is
C=n1*g(n2), where n1,n2 are the number of tuples in R1 and R2 and
g(n2) is the (join-method-specific) differential cost incurred
per tuple of n1
(Sort-merge, which has cost function
C=n1+n2+n1 log n1+n2 log n2, doesn't fit this
model; we return to it later.)
- Processing (execution) tree types:
- LPT: linear processing tree; at most 1 temporary relation
is used as input to any join. (I.e. a left- or right-deep
tree, not a bushy one.)
- Binary LPT (BLPT): all joins are 2-way.
- PPT: pipelined processing tree; multiway joins are
pipeline-executed left to right (nested loop n-ary join).
- Cost equations:
- The cardinality n1,2 of joining R1
with R2m is Product(i=1...j)
(si*ni)
- The cost of a pipelined processing tree execution (PPT) is
the sum of the costs of each pairwise join, that is,
Sum(j=2...k)
(nj-1,j*gj(nj).
- The cost of a BLPT is the same as the above but
incorporates additional cost of storing intermediate join
results, i.e. replace the term gj(nj
with
(gj(nj) + Cnjsj).
- Authors claim that the above equations (in 2. and 3.) have
"the same structure". I disagree, since the additional
dependence on sj in equation 3 is nonlinear.
- Strategy for tree queries: use the ASI (adjacent sequence
interchange) property: for sequences A,B,U,V, the execution AUVB
costs less than AVUB iff the rank(U)<rank(V). I.e. the order
in which to do U and V can be decided irrespective of A and B.
- So: optimize a PPT by optimizing each subquery such that the
optimization respects the tree's partial ordering. I.e. child
joins must be done before their parents, but the order in which
children are done may not matter; so order them by rank. In
other words, convert each subtree to a chain using the rank function.
- Optimal sequence based on the "rank" cost function can be
produced in O(N log N) time if N joins are to be sequenced.
Relevance
Flaws
- Assumption throughout is that database is entirely RAM-resident
-- no disk costs. Assumption is later "relaxed" by claiming that
disk I/O cost is subsumed into the g(n) component of the "generic
cost equation". Given how much work has been done on minimizing
I/O's and optimizing their execution, I find this bogus.
- The authors claim to present a formalism for optimizing
nonrecursive queries. What they actually present is a bunch of
equations (which I still don't believe really have the same
structure), not at all the same thing. The framework they
describe doesn't seem general enough to be called a formalism,
and does not provide any new powers of reasoning about optimizing
query trees, just cost models.
- The authors need to learn to organize ideas, to write good prose,
and to typeset; or they need to delegate these duties to a
capable human or program.
Back to index