Hadoop MapReduce is programming model for processing large data that resides on cluster( hundreds ofcomputers), which has been popularized recently by Google, Hadoop, and many others.The paradigm is extraordinarily powerful, but it does not provide a general solution to what many are calling “big data”.
Many people do not realize that Hadoop MapReduce is more of a framework than a tool. You have to fit your solution into the framework of map and reduce, which in some situations might be challenging. Hadoop MapReduce is not a feature, but rather a constraint.The tradeoff of being confined to the MapReduce framework is the ability to process your data with distributed computing, without having to deal with concurrency, robustness, scale, and other common challenges.
Hadoop MapReduce having two phases:
During the Map phase, the input data is divided into input splits for analysis by map tasks running in parallel across the Hadoop cluster. By default, the MapReduce framework gets input data from the Hadoop Distributed File System (HDFS).
The Reduce phase uses results from map tasks as input to a set of parallel reduce tasks. The reduce tasks consolidate the data into final results. By default, the MapReduce framework stores results in HDFS. Although the reduce phase depends on output from the map phase, map and reduce processing is not necessarily sequential. That is, reduce tasks can begin as soon as any map task completes.
MapReduce operates on key-value pairs. Conceptually, a Hadoop MapReduce job takes a set of input key-value pairs and produces a set of output key-value pairs by passing the data through map and reduce functions. The map tasks produce an intermediate set of key-value pairs that the reduce tasks uses as input.
Flow Digram of Hadoop MapReduce:
Example diagram for finding frequent Itemsets using Hadoop MapReduce:
Input :1 2 3
4 2 3
1 3 2