Fast Algorithms for Mining Association Rules

Rakesh Agrawal and Ramakrishnan Srikant

One-line summary: Two new algorithms for discovering association rules between items in large databases of sales transactions are presented; empirical evidence proves that these algorithms outperform other known algorithms by factors of 3 to 10, and that scale linearly with database size.

Relevance

Data mining is a hot topic, very popular in industry and now in research. This algorithm apparently will speed up the process of discovering association rules - in case you want to know how many people that buy toothpaste also buy suntan lotion at the same time, this is the algorithm to use.

Flaws

No explicit details on number of I/Os, memory requirements, CPU requirements, temporary storage requirements, etc. are given for this (or any of the competing) algorithms.
Their synthetic workload has a decent amount of thought put in it, but who knows how realistic it really is?
Scaling only done up to a database of size 10,000,000. Not at all realistic: databases tend to be much, much larger for interesting problems.

Overview/Main Points

Association Rules:
- Given a set of items I = {i₁, i₂, ..., i_m}, a set of transactions D where each transaction is a set of items T subset I, we say:
- T contains X if X subset T.
- X => Y is an implication where X strictsubset I, y strictsubset I, and X intersect Y = empty
- We say that the rule X => Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y.
- The rule X => Y has support s in the transaction set D if s% of transactions in D contain X union Y.
Problem at hand: mine all association rules that have support and confidence greater than some user-specified minimum support and confidence out of database D. Way to decompose problem into two subproblems:
1. Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets. An itemset of size k is a k-itemset.
2. Use the large itemsets to generate the desired rules. Easy algorithm: for every subset a, output rule of form a => (l-a) if the ration of support(l) to support(a) is at least the specified minimum confidence.
The authors claim subproblem 2 is easy, and concentrate on subproblem 1.
Some notation:
- L_k: Set of large k-itemsets (i.e. those with minimum support). Each member of set has an itemset and a support count.
- C_k: Set of candidate k-itemsets (potentially large). Has itemset and support count.
- Cbar_k: Set of candidate k-itemsets with TIDs of generating transactions kept associated with candidates.
Apriori algorithm:
- First pass: counts item occurrences to determine the large 1-itemsets.
- 2nd and subsequent passes:
  for (k=2; L_k-1 not empty; k++)
  - C_k = apriori-gen(L_k-1); // (New candidates)
  - forall transactions t in D do
    - C_t = subset(C_k,t) // (Candidates contained in t)
    - forall candidates c in C_t do
      - c.count++
  - L_k = {c in C_k | c.count >= minsupport }
- Answer = Union_k(L_k)
In other words, apriori starts with a seed set of itemsets found to be large in the previous pass, and uses it to generate new potentially large itemsets (called candidate itemsets). The actual support for these candidates is counted during the pass over the data, and non-large candidates are thrown out.
The magic of apriori is in how candidates are generated and counted. In other algos (AIS and SETM), candidates are generated on-the-fly during the pass as data is read, i.e. the set of itemsets found large in the previous pass and which are present in the transaction are extended with other items in the transaction to generate new candidates. In apriori, candidate itemsets to be counted are generated using only itemsets found large in the previous pass, without considering the transaction that is in the database.
apriori-gen is the function that performs this magic. It accepts the set of all large (k-1)-itemsets L_k-1, and returns a superset of the set of all large k-itemsets. First, it performs a join of L_k-1 to itself to generate C_k, and then prunes all itemsets from C_k such that some (k-1)-subset of that itemset is not in L_k-1.
Apriori-tid: instead of doing a run over the database for counting support after the first pass, store the TID along with the set of potentially large k-itemsets present in the transaction with id TID in table Cbar_k.
Performance: Apriori and apriori-tid beat out all competitors based on a synthetic workload. Apriori is better for the first few passes (because Cbar_k is large), but apriori-tid is better for subsequent passes (because Cbar_k tends to shrink to much smaller than the size of the original database). Thus, you should use a hybrid: apriori to start, apriori-tid later on.

Back to index