Optimization of Nonrecursive Queries

R. Krishnamurthy, H. Boral, C. Zaniolo, MCC

Summary by Armando Fox and Steve Gribble

One-line summary: In executing a rooted join tree, sub-joins should be done in rank order (to the extent that the reordering respects the partial ordering inherent in the tree). Authors show how to compute join cost based on rank for such PPT's (pipelined processing query trees, i.e n-ary join bushy trees) and LPT (linear processing trees, i.e. left or right deep trees).

Overview/Main Points

Observation: it's more important for an optimizer to avoid the worst execution than to find the best one, since for many queries, pathological inputs or datasets could cause the worst case behavior to be orders of magnitude worse than the average case.
Authors work with join trees: queries where the join graph is acyclic. For a given rooted join tree, each relation has a unique pareant and a well-defined join selectivity (see Selinger) with respect to the parent. (The selectivity of the root is defined to be 1.)
For nested-loop and selective-access join, the cost function of join(R1,R2) is C=n1*g(n2), where n1,n2 are the number of tuples in R1 and R2 and g(n2) is the (join-method-specific) differential cost incurred per tuple of n1 (Sort-merge, which has cost function C=n1+n2+n1 log n1+n2 log n2, doesn't fit this model; we return to it later.)
Processing (execution) tree types:
- LPT: linear processing tree; at most 1 temporary relation is used as input to any join. (I.e. a left- or right-deep tree, not a bushy one.)
- Binary LPT (BLPT): all joins are 2-way.
- PPT: pipelined processing tree; multiway joins are pipeline-executed left to right (nested loop n-ary join).
Cost equations:
1. The cardinality n_1,2 of joining R₁ with R₂m is Product(i=1...j) (s_i*n_i)
2. The cost of a pipelined processing tree execution (PPT) is the sum of the costs of each pairwise join, that is, Sum(j=2...k) (n_j-1,j*g_j(n_j).
3. The cost of a BLPT is the same as the above but incorporates additional cost of storing intermediate join results, i.e. replace the term g_j(n_j with (g_j(n_j) + Cn_js_j).
4. Authors claim that the above equations (in 2. and 3.) have "the same structure". I disagree, since the additional dependence on s_j in equation 3 is nonlinear.
Strategy for tree queries: use the ASI (adjacent sequence interchange) property: for sequences A,B,U,V, the execution AUVB costs less than AVUB iff the rank(U)<rank(V). I.e. the order in which to do U and V can be decided irrespective of A and B.
So: optimize a PPT by optimizing each subquery such that the optimization respects the tree's partial ordering. I.e. child joins must be done before their parents, but the order in which children are done may not matter; so order them by rank. In other words, convert each subtree to a chain using the rank function.
Optimal sequence based on the "rank" cost function can be produced in O(N log N) time if N joins are to be sequenced.

Relevance

Flaws

Assumption throughout is that database is entirely RAM-resident -- no disk costs. Assumption is later "relaxed" by claiming that disk I/O cost is subsumed into the g(n) component of the "generic cost equation". Given how much work has been done on minimizing I/O's and optimizing their execution, I find this bogus.
The authors claim to present a formalism for optimizing nonrecursive queries. What they actually present is a bunch of equations (which I still don't believe really have the same structure), not at all the same thing. The framework they describe doesn't seem general enough to be called a formalism, and does not provide any new powers of reasoning about optimizing query trees, just cost models.
The authors need to learn to organize ideas, to write good prose, and to typeset; or they need to delegate these duties to a capable human or program.

Back to index