Apriori Algorithm for Generating Frequent Itemsets

apriori algorithm

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive.

Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:

 

If an itemset is frequent, then all of its subsets must also be frequent.

 

Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent.

The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent k-itemsets are used to generate k+1-itemsets significantly reducing number of candidates. Let’s see an example:

Our dataset contains four transactions with following particulars:

TID Items
100 1, 3, 4
200 2, 3, 5
300 1, 2, 3, 5
400 2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2.

Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

Itemset Support
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3

Now, mark itemsets where support is below threshold

Itemset Support
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold.

Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

Itemset Support
{1, 2} 1
{1, 3} 2
{1, 5} 1
{2, 3} 2
{2, 5} 3
{3, 5} 2

Marking 2-itemsets below support threshold

Itemset Support
{1, 2} 1
{1, 3} 2
{1, 5} 1
{2, 3} 2
{2, 5} 3
{3, 5} 2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

Itemset Support
{2, 3, 5} 2

 

Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}.

Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}.

If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger!

Feel free to ask your questions in comments section below!

2
Comments

avatar
1 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
Devji ChhangaDhavan Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Dhavan
Guest
Dhavan

Here is an implementation to get all subsets of a set in Scheme:

(define (subsets s)
(if (null? s)
(list ‘())
(let ((rest (subsets (cdr s))))
(append rest (map (lambda (x) (cons (car s) x)) rest)))))

A very nice and clear explanation. I will try to implement the algorithm in Scheme.

I have a question:
What is the mathematical reason of dividing by “total transactions” while calculating the support value?