Back to index
Data Cube: A Relational Aggregation Operator Generalizing Group-By,
Cross-Tab, and Sub-Totals.
Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh.
Summary by: Steve Gribble and Armando Fox
One-line summary:
Aggregates and group-by operators produce 0- or 1-dimensional answers;
cross-tabs produce 2-d answers; the data-cube produces N-dimensional
aggregates and answers by treating each of the N aggregation attributes as
a dimension of N-space.
Relevance
Cool new operator - proven to be very, very useful while playing with data,
and has become widely adopted. Good to read about.
Flaws
- I cannot easily imagine the visualization techniques necessary to
extend this to N>3 dimensions. Sure, color and motion could be
used, but that is not at all intuitive for a human. 3-d shadows,
slices, or extrusions could be used, but that gets cluttered and
confusing (e.g. after many years of churning away at it, I still
can't totally grok a hypercube. Maybe I'm just bad at this sort
of a thing.)
- The section discussing algebraic vs. distributive vs. holistic
wasn't clear about the exact cost of computing the power set.
Give me the gory details!
- Can the power set be lazily generated somehow? Probably don't
need all dimentionalities of aggregations, and could compute on
demand.
Overview/Main Points
- aggregation operators:
- 5 functions in SQL:
count, sum, min, max, and avg
;
these return a single value.
- group by operator creates a table of aggregates indexed
by some set of attributes (e.g.
max(salary) group by
department
)
- proprietary extensions to SQL:
cumulative sum,
running sum, running average
(Red Brick systems).
- problems with group by:
- does not easily permit histograms (need to construct
table-valued expression, then aggregate over the
resulting table, and even then histogram does not
necessarily have evenly spaced categories)
- roll-up totals and sub-totals for drill-downs.
Requires N unions for an N-dimensional roll-up.
- roll-up total: reports aggregate data at a
coarse level (e.g.
average(sales) group
by model
), then at a finer level
(e.g. average(sales) group by
model,year
), etc. for successively
finer levels
- going from finer to coarser is rolling-up the
data, from coarser to finer is drilling-down
the data (I think - paper was ambiguous
about which term corresponds to which
direction.)
- cross-tabulations. imagine data:
Model | Year | Color | sales |
Chevy | 1994 | black | 50 |
Chevy | 1994 | white | 40 |
Chevy | 1995 | black | 85 |
Chevy | 1995 | white | 115 |
The cross-tabulation for this adds rows and colums to
give symmetric aggregation results across each
dimension:
Chevy | 1994 | 1995 | total(ALL) |
black | 50 | 85 | 135 |
white | 40 | 115 | 155 |
total(ALL) | 90 | 200 | 290 |
The data in italics is the original data from the
table. The extra column on the right is the
aggregation across years grouped by color, The extra
row on the bottom is the aggregation across color
grouped by year (i.e. 1-dimensional aggregate of 2-d
data). The extra point in the bottom-right
is the aggregation across all data (i.e. 0-dimensional
aggregate of the 2-d data).
- data-cube: simply a generalization of the cross-tab to
N-dimensions (usually 3). Core cube is data. Extra planes are
added for 2-d aggregations of the 3-d data. Extra edges are
added for 1-d aggregations of the 3-d data. An extra vertex is
added for 0-d aggregation of the 3-d data.
- end up with 2^n aggregates for the N-d data cube, one for each
set in the power set that represents the cube.
- issues:
- to implement as a relation, need the power set of the
relation in question - use the special token ALL to
indicate fields that have been aggregated across
(e.g. if a tuple from the sales table is
(Chevy, 1990, blue, 64)
, then you could
imagine a new field in the power set
(Chevy, ALL, blue, 182)
which is the
aggregation across all years of blue Chevys.)
- ALL token greatly complicates SQL code. Think of ALL
as being a set-value, e.g. above, ALL would represent
multiple years
{1990, 1991, 1992}
.
- How do you efficiently compute this power set (which is
really the data cube)? (If data
aggregation function has special properties, e.g. is
Distributive, Algebraic, or
Holistic, can exploit those properties in the
algorithm for computing the data cube.
Back to index