Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive.

Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:

If an itemset is frequent, then all of its subsets must also be frequent.

Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent.

The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent k-itemsets are used to generate k+1-itemsets significantly reducing number of candidates. Let’s see an example:

Our dataset contains four transactions with following particulars:

 TID Items 100 1, 3, 4 200 2, 3, 5 300 1, 2, 3, 5 400 2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2.

Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

 Itemset Support {1} 2 {2} 3 {3} 3 {4} 1 {5} 3

Now, mark itemsets where support is below threshold

 Itemset Support {1} 2 {2} 3 {3} 3 {4} 1 {5} 3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold.

Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

 Itemset Support {1, 2} 1 {1, 3} 2 {1, 5} 1 {2, 3} 2 {2, 5} 3 {3, 5} 2

Marking 2-itemsets below support threshold

 Itemset Support {1, 2} 1 {1, 3} 2 {1, 5} 1 {2, 3} 2 {2, 5} 3 {3, 5} 2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

 Itemset Support {2, 3, 5} 2

Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}.

Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}.

If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger!

Feel free to ask your questions in comments section below! ## Want to Learn More? Signup in a Click. 0 Followers

Most reacted comment
2 Comment authors  Recent comment authors
Subscribe
Notify of Guest
Dhavan

Here is an implementation to get all subsets of a set in Scheme:

(define (subsets s)
(if (null? s)
(list ‘())
(let ((rest (subsets (cdr s))))
(append rest (map (lambda (x) (cons (car s) x)) rest)))))

A very nice and clear explanation. I will try to implement the algorithm in Scheme.

I have a question:
What is the mathematical reason of dividing by “total transactions” while calculating the support value?