Back to index
BIRCH: An Efficient Data Clustering Method for Very Large Databases
Tian Zhang, Raghu Ramakrishnan, Miron Livny, UW Madison
Summary by: Armando Fox and Steve Gribble
One-line summary:
Scan data and build balanced trees representing data clusters; each
point is inserted into a cluster based on a local decision, and may
cause the cluster to split/merge (like GiST); when done, do a little
polishing, toss some outliers (all using heuristics with no statistical
basis), and voila, you have data clustering.
Overview/Main Points
- Clustering: identification of densely populated regions of a
large sparse multidimensional dataset. This paper (and most
related work) does clustering based on metric attributes,
i.e. those that satisfy Euclidian space requirements of
self-identity and triangle inequality.
- Goal: minimize running time and space requirements
- Goal: handle "noise" (outliers of clusters) well
- Formally: Given desired number of clusters K and N points,
and distance-based measurement function that measures the
cluster sizes (see below), partition the dataset
such that the measurement function is minimized.
- Related work
- Probability-based approaches: typically assume that
prob. distributions on different attributes are
statistically independent, which is often false.
- Distance-based approaches inspect all data points or
existing clusters equally for each clustering decision,
rather than looking only at "best" candidates; therefore
don't enjoy linear scaling.
- CLARANS, a randomized-search approach: search a graph in
which each node is a K-partition reprsented by a set of K
medoids (point closest to partition center); for
each node, check at most maxneighbors clusters that
differ from it by only 1 medoid; if better neighbor found,
choose
it; otherwise record this node as a local min. Continue
until numlocal local mins have been found and
return the best subset of these.
- BIRCH properties:
- Local decisions at each step give scalability
- Points in sparse regions treated as outliers, thrown away
- Incremental method: doesn't require entire dataset in
advance, and only scans dataset once
- Definitions of distances for clusters:
- Centroid: Euclidean
- Radius: avg distance from any member point to centroid
- Diameter: avg pairwise distance in cluster
Any of the following can be used as distance metric
to compare a new data point to existing clusters:
in BIRCH algorithm:
- D0=Euclidean distance from centroid
- D1=Manhattan distance from centroid (only motion along
axes permitted)
ANd for deciding whether to merge clusters:
- D2=Average Inter-cluster distance between 2 clusters
- D3=Average intra-cluster distance inside a cluster
(geometrically, the diameter of new cluster if 2 clusters
are merged)
- D4=Variance increase distance: (?) amount by which the
intracluster distance variance changes if 2 clusters are merged
- Clustering features and CF trees:
- CF is a 3-vector (N,LS,SS) where N is num of data
points in this cluster, LS is a vector sum of the N data points,
and SS is the sum of the (vector) squares of the points.
- Trivial theorem: if two clusters are merged, the CF of the
merged cluster is the sum of the original CF's.
- CF tree: a height-balanced tree where each node represents
the cluster that would be created by merging its
children. Leaves are current clusters.
- Leaf requirement: cluster diameter or radius must be below
some threshold T; if adding new point to the cluster would
violate this, must split the leaf.
- Optional periodic merging alleviates effects of splits
caused by page size.
- At last, the BIRCH algorithm:
- Authors used D2 and D4 distance metrics, which can be
calculated from CF vectors in O(N2).
- Phases:
- Linear-scan all data and insert in CF-tree
- Condense into desirable range by building smaller
CF-tree by removing some outliers and merging
clusters
- Global clustering over leaves (authors used D2 and
D4 distance metrics for this step, which takes
O(N2)
- (optional and offline) further refinement and
outlier elimination; provably converges to a minimum
- Phase 1: initialize T, scan data values, insert points
into CF-tree. If run out of memory before we're done,
need to increase T, and begin again by reinserting
existing leaves into new tree.
- Authors prove that if T is increased, new tree will be
non-larger than old tree.
- How to choose T in phase 1 to prevent memory overflow but still
get good clustering? "Beyond scope of paper", but authors have
some heuristics:
- Choose next T in proportion to data seen so far.
- Want to increase threshold based on some measure of
cluster volume. Can use either average volume of cluster,
or actual volume of leaves. Either way, maintain record
of leaf radii as function of number of points, use
least-squares regression to estimate growth order, then
extrapolate r and compute the volume from it.
- Make T greater than the distance between the two closest
in the closest clusters on the most crowded leaf, so that
these clusters will be merged.
- Multiply current T by an expansion factor based on the
above distance.
- Wave a dead chicken and ask the spirits for a value of T.
- Outlier handling:
- Old leaf entry considered to be potential outlier if it
has "far fewer" points than average leaf. "Far fewer" is
a heuristic.
- If disk space runs out, rescan data to see if outliers can
be reabsorbed (due to recent increase in T, e.g.) If they
can't, good chance that they are true outliers.
- Performance:
- Synthetic data generator generates clusters centered on
grid points, on sine wave, or randomly; points are
zero-mean normally-distributed aroudn cluster centers.
- BIRCH is faster, produces better clusters, etc. than its
competitors....boring section, not much in the way of
proving the actual performance assumptions.
- Interesting example: clustering colors to characterize
images. Soudns useful to me.
Relevance
Single-pass, sort-of-linear time algorithm that results in a
sort-of-clustering of large number of data points, with most outliers
sort-of-thrown-out.
Flaws
- No statistical bounds on anything.
- O(N2) global-clustering step is not linearly scalable,
contrary to (misleading?) claim in opening section.
- Performance analysis unconvincing.
- Heuristics for choosing T and finding outliers are a hack.
Heuristics are reasonable but need more justification that they
actually work.
- A nice idea but needs more hardcore analysis.
Back to index