Apriori Algorithm for Generating Frequent Itemsets

A

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive.

Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:

If an itemset is frequent, then all of its subsets must also be frequent.

Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent.

The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent k-itemsets are used to generate k+1-itemsets significantly reducing number of candidates. Let’s see an example:

Our dataset contains four transactions with following particulars:

 TID Items 100 1, 3, 4 200 2, 3, 5 300 1, 2, 3, 5 400 2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2.

Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

 Itemset Support {1} 2 {2} 3 {3} 3 {4} 1 {5} 3

Now, mark itemsets where support is below threshold

 Itemset Support {1} 2 {2} 3 {3} 3 {4} 1 {5} 3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold.

Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

 Itemset Support {1, 2} 1 {1, 3} 2 {1, 5} 1 {2, 3} 2 {2, 5} 3 {3, 5} 2

Marking 2-itemsets below support threshold

 Itemset Support {1, 2} 1 {1, 3} 2 {1, 5} 1 {2, 3} 2 {2, 5} 3 {3, 5} 2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

 Itemset Support {2, 3, 5} 2

Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}.

Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}.

If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger!

This site uses Akismet to reduce spam. Learn how your comment data is processed.

• Dhavan says:

Here is an implementation to get all subsets of a set in Scheme:

(define (subsets s)
(if (null? s)
(list ‘())
(let ((rest (subsets (cdr s))))
(append rest (map (lambda (x) (cons (car s) x)) rest)))))

A very nice and clear explanation. I will try to implement the algorithm in Scheme.

I have a question:
What is the mathematical reason of dividing by “total transactions” while calculating the support value?

• In this example only absolute support value is used, that is if an itemset appears 3 times support is stated as 3. But if you want to generalize support calculation, you need to use this formula. Obviously, denominator is total number of TXNs because we want to figure out how frequent an itemset is!