Back to index

# Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.

Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh.

Summary by: Steve Gribble and Armando Fox

One-line summary: Aggregates and group-by operators produce 0- or 1-dimensional answers; cross-tabs produce 2-d answers; the data-cube produces N-dimensional aggregates and answers by treating each of the N aggregation attributes as a dimension of N-space.

## Relevance

Cool new operator - proven to be very, very useful while playing with data, and has become widely adopted. Good to read about.

## Flaws

• I cannot easily imagine the visualization techniques necessary to extend this to N>3 dimensions. Sure, color and motion could be used, but that is not at all intuitive for a human. 3-d shadows, slices, or extrusions could be used, but that gets cluttered and confusing (e.g. after many years of churning away at it, I still can't totally grok a hypercube. Maybe I'm just bad at this sort of a thing.)
• The section discussing algebraic vs. distributive vs. holistic wasn't clear about the exact cost of computing the power set. Give me the gory details!
• Can the power set be lazily generated somehow? Probably don't need all dimentionalities of aggregations, and could compute on demand.

## Overview/Main Points

• aggregation operators:
• 5 functions in SQL: ```count, sum, min, max, and avg ```; these return a single value.
• group by operator creates a table of aggregates indexed by some set of attributes (e.g. ```max(salary) group by department```)
• proprietary extensions to SQL: ```cumulative sum, running sum, running average``` (Red Brick systems).
• problems with group by:
• does not easily permit histograms (need to construct table-valued expression, then aggregate over the resulting table, and even then histogram does not necessarily have evenly spaced categories)
• roll-up totals and sub-totals for drill-downs. Requires N unions for an N-dimensional roll-up.
• roll-up total: reports aggregate data at a coarse level (e.g. ```average(sales) group by model```), then at a finer level (e.g. ```average(sales) group by model,year```), etc. for successively finer levels
• going from finer to coarser is rolling-up the data, from coarser to finer is drilling-down the data (I think - paper was ambiguous about which term corresponds to which direction.)
• cross-tabulations. imagine data:
ModelYearColorsales
Chevy1994black50
Chevy1994white40
Chevy1995black85
Chevy1995white115
The cross-tabulation for this adds rows and colums to give symmetric aggregation results across each dimension:
Chevy19941995total(ALL)
black5085135
white40115155
total(ALL)90200290
The data in italics is the original data from the table. The extra column on the right is the aggregation across years grouped by color, The extra row on the bottom is the aggregation across color grouped by year (i.e. 1-dimensional aggregate of 2-d data). The extra point in the bottom-right is the aggregation across all data (i.e. 0-dimensional aggregate of the 2-d data).
• data-cube: simply a generalization of the cross-tab to N-dimensions (usually 3). Core cube is data. Extra planes are added for 2-d aggregations of the 3-d data. Extra edges are added for 1-d aggregations of the 3-d data. An extra vertex is added for 0-d aggregation of the 3-d data.
• end up with 2^n aggregates for the N-d data cube, one for each set in the power set that represents the cube.
• issues:
• to implement as a relation, need the power set of the relation in question - use the special token ALL to indicate fields that have been aggregated across (e.g. if a tuple from the sales table is `(Chevy, 1990, blue, 64)`, then you could imagine a new field in the power set `(Chevy, ALL, blue, 182)` which is the aggregation across all years of blue Chevys.)
• ALL token greatly complicates SQL code. Think of ALL as being a set-value, e.g. above, ALL would represent multiple years `{1990, 1991, 1992}`.
• How do you efficiently compute this power set (which is really the data cube)? (If data aggregation function has special properties, e.g. is Distributive, Algebraic, or Holistic, can exploit those properties in the algorithm for computing the data cube.

Back to index