Randomized Algorithms for Optimizing Large Join Queries

Yannis E. Ioannidis and Younkyung Cha Kang

One-line summary: Two randomized algorithms (iterative improvement and simulated annealing), as well as a hybrid, are presented, analyzed, and measured; the hybrid scheme seems to win.

Overview/Main Points

State space: each state in query optimization corresponds to an access plan (strategy) of the query to be optimized. Here, plans are join processing trees, where leaves are base relations, internal nodes are join operators, and edges indicate the flow of data.
Neighbours: neighbours of a state are determined by a set of transformation rules. Applying a rule usually changes the cost of the tree only in a local spot of the tree, so recomputing costs is cheap.
1. the join method choice (hash, nested loops, sort-merge) can be changed
2. the join can be commuted: a join b -> b join a
3. associativity can be applied: (a join b) join c -> a join (b join c)
4. a left join exchange: (a join b) join c -> (a join c) -> join b
5. a right join exchange: a join (b join c) -> b join (a join c)
Iterative Improvement (II):
- pick a random state
- while not at a local minima,
  - pick a random neighbour to the current state
  - if the neighbour has a lower cost, move there
- repeat until a stopping condition is achieved (a timeout for II)
Simulated annealing (SA): (frozen is defined as temp 0)
- while not frozen,
  - while not equilibrium,
    - pick random neighbour
    - if neighbour is lower cost, go there
    - if not, go there anyway with a probability directly related to temperature and inversely related to cost
  - reduce temperature
- stop when frozen
Hybrid - Two phase optimization (2PO):
- Run II for some period of time to find a number of local optima
- take the best local optima, and use it as the starting state of SA, but run SA with a low starting temperature so it doesn't jump around too much
Performance:
- experiments run with random queries (start and tree queries, ranging between 5 and 100 joins). Three different relation catalogs of varying relation cardinalities, unique values, and selectivities were used.
- in nearly all cases, 2PO beats SA, and SA beats II. Although II initially decreases in cost quite quickly, it does not converge to the lowest global cost very quickly at all. SA eventually does converge to lowest global cost, but takes a while to get there. 2PO is best of both worlds.
State space analysis:
- Some (rather cheesy) experiments were done to try and quantify the shape of the very large dimensional state space.
- Their conclusion was that state space should be visualized as a cup, in which the floor of the cup has many local minima close to each other.

Relevance

Randomized algorithms work well for query optimization; they produce query plans that are competitive in quality with exhaustive search, but at faster rates.

Flaws

The assumptions made for the cost functions seem highly unlifelike: no pipelining of joins, minimal buffering, no duplicate elimination.
Scant support for the claim that the execution time savings of using r-local minima outweigh the potential misses of true local minima.
The state space analysis was just totally unbelievable. They leap to conclusions about a potentially very hairy topology with nearly no data at all.
Are these the only randomized algorithms that are applicable? What others should be explored?
The workloads they ran the optimizations against were generated with healthy doses of randomness, leading me to suspect that they had little to do with the real world. How do these algorithms perform with real-world workloads?

Back to index