Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.

Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh.

Summary by: Steve Gribble and Armando Fox

One-line summary: Aggregates and group-by operators produce 0- or 1-dimensional answers; cross-tabs produce 2-d answers; the data-cube produces N-dimensional aggregates and answers by treating each of the N aggregation attributes as a dimension of N-space.

Relevance

Cool new operator - proven to be very, very useful while playing with data, and has become widely adopted. Good to read about.

Flaws

I cannot easily imagine the visualization techniques necessary to extend this to N>3 dimensions. Sure, color and motion could be used, but that is not at all intuitive for a human. 3-d shadows, slices, or extrusions could be used, but that gets cluttered and confusing (e.g. after many years of churning away at it, I still can't totally grok a hypercube. Maybe I'm just bad at this sort of a thing.)
The section discussing algebraic vs. distributive vs. holistic wasn't clear about the exact cost of computing the power set. Give me the gory details!
Can the power set be lazily generated somehow? Probably don't need all dimentionalities of aggregations, and could compute on demand.

Overview/Main Points

aggregation operators:
- 5 functions in SQL: count, sum, min, max, and avg; these return a single value.
- group by operator creates a table of aggregates indexed by some set of attributes (e.g. max(salary) group by department)
- proprietary extensions to SQL: cumulative sum, running sum, running average (Red Brick systems).
problems with group by:
- does not easily permit histograms (need to construct table-valued expression, then aggregate over the resulting table, and even then histogram does not necessarily have evenly spaced categories)
- roll-up totals and sub-totals for drill-downs. Requires N unions for an N-dimensional roll-up.
  - roll-up total: reports aggregate data at a coarse level (e.g. average(sales) group by model), then at a finer level (e.g. average(sales) group by model,year), etc. for successively finer levels
  - going from finer to coarser is rolling-up the data, from coarser to finer is drilling-down the data (I think - paper was ambiguous about which term corresponds to which direction.)
- cross-tabulations. imagine data:
  
  Model Year Color sales
  
  Chevy 1994 black 50
  
  Chevy 1994 white 40
  
  Chevy 1995 black 85
  
  Chevy 1995 white 115
  
  The cross-tabulation for this adds rows and colums to give symmetric aggregation results across each dimension:
  
  Chevy 1994 1995 total(ALL)
  
  black 50 85 135
  
  white 40 115 155
  
  total(ALL) 90 200 290
  
  The data in italics is the original data from the table. The extra column on the right is the aggregation across years grouped by color, The extra row on the bottom is the aggregation across color grouped by year (i.e. 1-dimensional aggregate of 2-d data). The extra point in the bottom-right is the aggregation across all data (i.e. 0-dimensional aggregate of the 2-d data).
data-cube: simply a generalization of the cross-tab to N-dimensions (usually 3). Core cube is data. Extra planes are added for 2-d aggregations of the 3-d data. Extra edges are added for 1-d aggregations of the 3-d data. An extra vertex is added for 0-d aggregation of the 3-d data.
end up with 2^n aggregates for the N-d data cube, one for each set in the power set that represents the cube.
issues:
- to implement as a relation, need the power set of the relation in question - use the special token ALL to indicate fields that have been aggregated across (e.g. if a tuple from the sales table is (Chevy, 1990, blue, 64), then you could imagine a new field in the power set (Chevy, ALL, blue, 182) which is the aggregation across all years of blue Chevys.)
- ALL token greatly complicates SQL code. Think of ALL as being a set-value, e.g. above, ALL would represent multiple years {1990, 1991, 1992}.
- How do you efficiently compute this power set (which is really the data cube)? (If data aggregation function has special properties, e.g. is Distributive, Algebraic, or Holistic, can exploit those properties in the algorithm for computing the data cube.

Model	Year	Color	sales
Chevy	1994	black	50
Chevy	1994	white	40
Chevy	1995	black	85
Chevy	1995	white	115

Chevy	1994	1995	total(ALL)
black	50	85	135
white	40	115	155
total(ALL)	90	200	290

Back to index