BIRCH: An Efficient Data Clustering Method for Very Large Databases

Tian Zhang, Raghu Ramakrishnan, Miron Livny, UW Madison Summary by: Armando Fox and Steve Gribble

One-line summary: Scan data and build balanced trees representing data clusters; each point is inserted into a cluster based on a local decision, and may cause the cluster to split/merge (like GiST); when done, do a little polishing, toss some outliers (all using heuristics with no statistical basis), and voila, you have data clustering.

Overview/Main Points

Clustering: identification of densely populated regions of a large sparse multidimensional dataset. This paper (and most related work) does clustering based on metric attributes, i.e. those that satisfy Euclidian space requirements of self-identity and triangle inequality.
- Goal: minimize running time and space requirements
- Goal: handle "noise" (outliers of clusters) well
- Formally: Given desired number of clusters K and N points, and distance-based measurement function that measures the cluster sizes (see below), partition the dataset such that the measurement function is minimized.
Related work
- Probability-based approaches: typically assume that prob. distributions on different attributes are statistically independent, which is often false.
- Distance-based approaches inspect all data points or existing clusters equally for each clustering decision, rather than looking only at "best" candidates; therefore don't enjoy linear scaling.
- CLARANS, a randomized-search approach: search a graph in which each node is a K-partition reprsented by a set of K medoids (point closest to partition center); for each node, check at most maxneighbors clusters that differ from it by only 1 medoid; if better neighbor found, choose it; otherwise record this node as a local min. Continue until numlocal local mins have been found and return the best subset of these.
BIRCH properties:
- Local decisions at each step give scalability
- Points in sparse regions treated as outliers, thrown away
- Incremental method: doesn't require entire dataset in advance, and only scans dataset once
Definitions of distances for clusters:
- Centroid: Euclidean
- Radius: avg distance from any member point to centroid
- Diameter: avg pairwise distance in cluster
  Any of the following can be used as distance metric to compare a new data point to existing clusters: in BIRCH algorithm:
- D0=Euclidean distance from centroid
- D1=Manhattan distance from centroid (only motion along axes permitted)
  ANd for deciding whether to merge clusters:
- D2=Average Inter-cluster distance between 2 clusters
- D3=Average intra-cluster distance inside a cluster (geometrically, the diameter of new cluster if 2 clusters are merged)
- D4=Variance increase distance: (?) amount by which the intracluster distance variance changes if 2 clusters are merged
Clustering features and CF trees:
- CF is a 3-vector (N,LS,SS) where N is num of data points in this cluster, LS is a vector sum of the N data points, and SS is the sum of the (vector) squares of the points.
- Trivial theorem: if two clusters are merged, the CF of the merged cluster is the sum of the original CF's.
- CF tree: a height-balanced tree where each node represents the cluster that would be created by merging its children. Leaves are current clusters.
- Leaf requirement: cluster diameter or radius must be below some threshold T; if adding new point to the cluster would violate this, must split the leaf.
- Optional periodic merging alleviates effects of splits caused by page size.
At last, the BIRCH algorithm:
- Authors used D2 and D4 distance metrics, which can be calculated from CF vectors in O(N²).
- Phases:
  1. Linear-scan all data and insert in CF-tree
  2. Condense into desirable range by building smaller CF-tree by removing some outliers and merging clusters
  3. Global clustering over leaves (authors used D2 and D4 distance metrics for this step, which takes O(N²)
  4. (optional and offline) further refinement and outlier elimination; provably converges to a minimum
- Phase 1: initialize T, scan data values, insert points into CF-tree. If run out of memory before we're done, need to increase T, and begin again by reinserting existing leaves into new tree.
- Authors prove that if T is increased, new tree will be non-larger than old tree.
How to choose T in phase 1 to prevent memory overflow but still get good clustering? "Beyond scope of paper", but authors have some heuristics:
- Choose next T in proportion to data seen so far.
- Want to increase threshold based on some measure of cluster volume. Can use either average volume of cluster, or actual volume of leaves. Either way, maintain record of leaf radii as function of number of points, use least-squares regression to estimate growth order, then extrapolate r and compute the volume from it.
- Make T greater than the distance between the two closest in the closest clusters on the most crowded leaf, so that these clusters will be merged.
- Multiply current T by an expansion factor based on the above distance.
- Wave a dead chicken and ask the spirits for a value of T.
Outlier handling:
- Old leaf entry considered to be potential outlier if it has "far fewer" points than average leaf. "Far fewer" is a heuristic.
- If disk space runs out, rescan data to see if outliers can be reabsorbed (due to recent increase in T, e.g.) If they can't, good chance that they are true outliers.
Performance:
- Synthetic data generator generates clusters centered on grid points, on sine wave, or randomly; points are zero-mean normally-distributed aroudn cluster centers.
- BIRCH is faster, produces better clusters, etc. than its competitors....boring section, not much in the way of proving the actual performance assumptions.
- Interesting example: clustering colors to characterize images. Soudns useful to me.

Relevance

Single-pass, sort-of-linear time algorithm that results in a sort-of-clustering of large number of data points, with most outliers sort-of-thrown-out.

Flaws

No statistical bounds on anything.
O(N²) global-clustering step is not linearly scalable, contrary to (misleading?) claim in opening section.
Performance analysis unconvincing.
Heuristics for choosing T and finding outliers are a hack. Heuristics are reasonable but need more justification that they actually work.
A nice idea but needs more hardcore analysis.

Back to index