One-line summary: Two new algorithms for discovering association rules between items in large databases of sales transactions are presented; empirical evidence proves that these algorithms outperform other known algorithms by factors of 3 to 10, and that scale linearly with database size.
I = {i1,
i2, ..., im}
, a set
of transactions D
where each transaction
is a set of items T subset I
, we say:
T contains X if X subset T
.
X => Y
is an implication where
X strictsubset I, y strictsubset
I, and X intersect Y = empty
X => Y
holds in
the transaction set D
with confidence c if
c% of transactions in D
that contain
X
also contain Y
.
X => Y
has support s in the
transaction set D
if s% of transactions in
D contain X union Y
.
a => (l-a)
if the ration of support(l)
to support(a) is at least the specified minimum
confidence.
t in D
do
c in
Ct
do
In other words, apriori starts with a seed set of itemsets found to be large in the previous pass, and uses it to generate new potentially large itemsets (called candidate itemsets). The actual support for these candidates is counted during the pass over the data, and non-large candidates are thrown out.
The magic of apriori is in how candidates are generated and counted. In other algos (AIS and SETM), candidates are generated on-the-fly during the pass as data is read, i.e. the set of itemsets found large in the previous pass and which are present in the transaction are extended with other items in the transaction to generate new candidates. In apriori, candidate itemsets to be counted are generated using only itemsets found large in the previous pass, without considering the transaction that is in the database.