Finding Frequent Itemsets using Hadoop-MapReduce Model
Frequent sets play an essential role in many Data Mining tasks that try to find interesting patterns from databases, such as association rules, correlations, sequences, episodes, classifiers and clusters. The mining of association rules is one of the most popular problems of all these. The identification of sets of items, products, symptoms and characteristics, which often occur together in the given database, can be seen as one of the most basic tasks in Data Mining.
Apriori is the most established algorithm for finding frequent itemsets from a transactional dataset; however, it needs to scan the dataset many times and to generate many candidate itemsets. Unfortunately, when the dataset size is huge, both memory use and computational cost can still be very expensive. In addition, single processor’s memory and CPU resources are very limited, which make the algorithm performance inefficient. Furthermore; because of the exponential growth of worldwide information, enterprises (organizations) have to deal with an ever growing amount of data. As these data grow past hundreds of gigabytes towards a terabyte or more, it becomes nearly impossible to process (mine) them on a single sequential machine. The solution for the above problems is parallel and distributed computing.(Hadoop-Mapreduce Framework)
Data Flow diagram of Apriori algorithm in Hadoop-MapReduce framework:
Here below to download the code for finding frequent itemsets:
Run this command on terminal: hadoop jar /mraprior.jar /groceries.csv /output1 /output2
In output1,we’ll see the 1-n frequent itemsets
In output2,we’ll see final results (assocation rule)