Back to index

# Fast Algorithms for Mining Association Rules

Rakesh Agrawal and Ramakrishnan Srikant

One-line summary: Two new algorithms for discovering association rules between items in large databases of sales transactions are presented; empirical evidence proves that these algorithms outperform other known algorithms by factors of 3 to 10, and that scale linearly with database size.

## Relevance

Data mining is a hot topic, very popular in industry and now in research. This algorithm apparently will speed up the process of discovering association rules - in case you want to know how many people that buy toothpaste also buy suntan lotion at the same time, this is the algorithm to use.

## Flaws

• No explicit details on number of I/Os, memory requirements, CPU requirements, temporary storage requirements, etc. are given for this (or any of the competing) algorithms.
• Their synthetic workload has a decent amount of thought put in it, but who knows how realistic it really is?
• Scaling only done up to a database of size 10,000,000. Not at all realistic: databases tend to be much, much larger for interesting problems.

## Overview/Main Points

• Association Rules:
• Given a set of items ```I = {i1, i2, ..., im}```, a set of transactions `D` where each transaction is a set of items `T subset I`, we say:
• `T contains X if X subset T`.
• `X => Y` is an implication where ```X strictsubset I, y strictsubset I, and X intersect Y = empty```
• We say that the rule `X => Y` holds in the transaction set `D` with confidence c if c% of transactions in `D` that contain `X` also contain `Y`.
• The rule `X => Y` has support s in the transaction set `D` if s% of transactions in D contain `X union Y`.
• Problem at hand: mine all association rules that have support and confidence greater than some user-specified minimum support and confidence out of database D. Way to decompose problem into two subproblems:
1. Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets. An itemset of size k is a k-itemset.
2. Use the large itemsets to generate the desired rules. Easy algorithm: for every subset a, output rule of form `a => (l-a)` if the ration of support(l) to support(a) is at least the specified minimum confidence.
The authors claim subproblem 2 is easy, and concentrate on subproblem 1.
• Some notation:
• Lk: Set of large k-itemsets (i.e. those with minimum support). Each member of set has an itemset and a support count.
• Ck: Set of candidate k-itemsets (potentially large). Has itemset and support count.
• Cbark: Set of candidate k-itemsets with TIDs of generating transactions kept associated with candidates.
• Apriori algorithm:
• First pass: counts item occurrences to determine the large 1-itemsets.
• 2nd and subsequent passes:
for (k=2; Lk-1 not empty; k++)
• Ck = apriori-gen(Lk-1); // (New candidates)
• forall transactions `t in D` do
• Ct = subset(Ck,t) // (Candidates contained in t)
• forall candidates ```c in Ct``` do
• c.count++
• Lk = {c in Ck | c.count >= minsupport }