Share

What does the Apriori algorithm do? In 1994 the Apriori algorithm was developed by R. Srikant and R. Agrawal. The term ‘Apriori’ means with prior knowledge and is relevant to what the algorithm does. The algorithm’s Apriori principle approach is to do level-wise searches. Where k is the number of itemsets with a specific property, by repetitive iterations the algorithm which is familiar with the specific property finds the k 1 frequent dataset in the most popular use of the Apriori algorithm.

The prime concept of the Apriori algorithm is that it is a data mining algorithm that is anti- monotonous in its support measures. It assumes that the frequent itemset has all its non-empty subsets being frequent. This property is called the Apriori property. Thus according to it if the itemset is infrequent then all its supersets will also be infrequent.

Apriori meaning /Apriori property statement can be summed up as assuming that

- As per Apriori property assumptions, all frequent itemset subsets must be frequent.
- If the itemset shows infrequency then all the supersets are also infrequent.

The Apriori property exposes many advantages and disadvantages of the Apriori algorithm and is important since it reduces space searched and improves the level-wise generation of frequent itemsets efficiency.

Let’s use the Apriori algorithm example below to understand the algorithm, its definitions, the steps of the Apriori algorithm, Apriori algorithm implementation and associative rules.

The below dataset for the Apriori algorithm is a set we use to walk through to find the frequent itemsets needed to generate the rules of the association.

Here minimum support count is 2 and the minimum confidence is 60%.

This is how the algorithm works.

Step-1: Suppose K=1

(I) One creates the table with its support count for each item and put it in a dataset called the Candidate set C1 as below.

(II) Next one compares the support count of the candidate set items with the specified minimum support count which is in this case 2. When the support_count of the item in the candidate set is less than 2 or the min_support value, it is removed. Thus we now have a set of L1 items denoted as itemset L1 below.

Step-2:We now use K=2 and repeat the process of the Apriori algorithm pseudocode.

- One generates the next candidate set C2 using L1 in what is called the ‘join step’. Its condition is that it accepts candidates with ‘condition of joining’ denoted by L
_{k-1}where L_{k-1}should have (K-2) common elements. - Each subset generated by the itemset and all its items is checked to see if the itemsets are frequent enough. If not these items are removed from the set. For Ex: Remove subsets {I1, I2} and {I1}, {I2} as they are frequent. Thus each itemset is checked.
- Calculate the support count of the itemsets by a dataset search and tabulate as below.

(II) Now we check candidate (C2) for its support count versus the minimum support count. In the example the min_ support=2. If the count is less than the minimum count, remove the items from the set. The new itemset generated is L2 shown below.

Step-3: We now generate the C3 candidate set and L2 with a join step operation**. **The joining condition is** **L_{k-1} where L_{k-1} has (K-2) common elements with it. Now the first element of for L2 matches and the generated itemset after joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}.

The operation of checking all itemsets’ subsets for the frequent value is done with it being removed if it is not frequent enough or less than 2 in the example. Now we have the subsets of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} all of which are frequent. For {I2, I3, I4} the subset {I3, I4} is not frequent so one has to remove it. Similarly, every itemset is checked against the database by using the support count and searching the dataset.

(II) Compare the candidate itemset (C3) support count values against the minimum support count set in the beginning. (Here the set value is 2) If this support count value of the candidate itemset is less than the minimum support value it is removed to get the new itemset L3.

Step-4:** **Generate the candidate set C4 from L3 with the join step. The condition of joining L_{k-1} and L_{k-1} where (K=4) is they have (K-2) elements between them which are common. For itemset L3 the first 2 items or elements match. Check if all sunsets of the itemset are frequent or not. On joining we get L3 {I1, I2, I3, I5} with its subset containing elements {I1, I3, I5}, which is not frequent. Thus no C4 itemset is made.

According to the theory behind the algorithm, no more frequent itemsets are found.

To ascertain the associative rules we need to check the confidence of the Apriori algorithm flowchart. A 60% confidence means that if customers who buy milk, bread, and butter are there, 60% of the customers buying the milk and bread also buy the butter. These are the rules of association written as below.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

Going back to the example of any frequent itemset we considered above the rule generation for the

Itemset {I1, I2, I3} //from L3 can be

[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%

[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%

[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%

[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%

[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%

[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

Thus we can say that if the minimum confidence rate is set at 50%, the first 3 rules will specify strong association rules.

One of the biggest limitations of the Apriori Algorithm is that it is slow. This is so because of the bare decided by the

- A large number of itemsets in the Apriori algorithm dataset.
- Low minimum support in the data set for the Apriori algorithm.
- The time needed to hold a large number of candidate-sets with many frequent itemsets.

Thus it is inefficient when used with large volumes of datasets.

As an example, if we assume there is a frequent-1 itemset with 10^4 from the set. The Apriori algorithm code needs to generate greater than 10^7 candidates with a 2-length which will then be tested and collected as an accumulation. To detect a size frequent pattern of size 100 (having v1, v2… v100) the algorithm generates 2^100 possible itemsets or candidates which is an example of an application of the Apriori algorithm.

Hence, the yield costs escalate and a lot of time wasted in candidate generation aka time complexity of the Apriori algorithm. Also, in its attempts to an improved the Apriori algorithm to check the many candidate itemsets obtained from the many sets, it scans the database many times using expensive resources. This in turn impacts the algorithm when the system memory is insufficient and there are a large number of frequent transactions. That’s why the algorithm becomes inefficient and slow with large databases.

The Apriori algorithm is used along with the rules of Boolean Algebraic Association for discovering frequently used itemsets from among others in a dataset. It knows the itemset properties that are frequent and hence the term apriori is used with the algorithm. The Apriori property is important as it reduces space searched and improves the level-wise generation of frequent itemsets efficiency. It generates a confidence % indicator. However, the disadvantages of the Apriori algorithm when working with large datasets is that the Apriori algorithm is slow, inefficient, and uses a lot of resources as it has to scan the database many times, generate a large number of candidate sets and check each of them.

If you are interested in making a career in the Data Science domain, our 11-month in-person **Postgraduate Certificate Diploma in Data Science **course can help you immensely in becoming a successful Data Science professional.

Want To Interact With Our Domain Experts LIVE?