MapReduce framework has 2 capabilities i.e. map and reduce. The implementation of this framework has evolved into much more since its introduction. It keeps the processing operations for parallel execution format. Problems are divided into smaller sub-problems that are executed independently. Then all sub-problem solutions are combined to get output. It is a core component of Apache Hadoop.
Features of MapReduce
Scheduling
Computing resources work in parallel by dividing problems into sub problems. This process of breaking tasks into sub-problems is known as Mapping. Scheduling of tasks are important for MapReduce. Mapping is based on the number of nodes in a cluster.
Error Handling
This is a very robust and fault-tolerant. Failures can take place on clustered nodes on which parts of the program are running. So the MapReduce must be able to identify the cause of the error to rectify it asap.
Synchronization
Execution of concurrent steps requires synchronization. This framework is in place. A mechanism known as shuffle and sort is used to collect the mapped data to prepare it for the reduction step. MapReduce also maintains the timings of the all the tasks. This enables it to start reduction after the mapping process has been completed.
Data Locality
The location of the code for MapReduce plays an important role in MapReduce as it determines the efficiency of the framework and the data required for its implementation. Best result is obtained when the code and data are present in the same system. The co-location of data and code gets the best results.
Scale Out Architecture
MapReduce model is built on architecture that enables us to accommodate more resources so that it can meet the higher computational demands of Big Data. So, the MapReduce engines are built in a way that it can get more machines whenever required.
How does MapReduce work?
MapReduce model can work on analysis data operations. If multiple operations are being performed on a dataset using MapReduce, the original dataset must remain the same in all cases whenever required. MapReduce model executes task by dividing it into functions i.e. map and reduce. Map is executed first in parallel systems. Reduce function takes the output of the map function to get the final output in an aggregated format.
Here are the steps in which MapReduce algorithm for map and reduce operations works:
- Take Large dataset/records of data
- Iteration of data
- Extract interesting patterns as initial preparation for output using map function.
- Optimization for further processing
- Computation of results using reduce function
- Get final output result
The Flow of data in MapReduce framework is:
Input -> Split -> Map -> Combine -> Shuffle and Sort -> Reduce -> Output
- The input for Operation is provided in key-value pairs.
- Input data is divided into small batches.
- Master and Slave nodes are created as required.
- Master executes on the machine where the data is located.
- Slaves work remotely on data.
- Map operation performed simultaneously. Map function extracts relevant data and generates key value pairs.
- Master instructs reduce function to take further actions for reduce operation on the data.
- Reduce function sorts data on the basis of the key field. This process of collecting output from the map function and then sorting is also known as Shuffling.
- Every unique key is taken by reduce function.
- The output is generated by the reduce function and the control is handed over to the user from the master.
Uses of MapReduce
MapReduce is very useful in various sectors and domains. Some use cases for using it are given below:
- Website Traffic and Visitors
- Word Frequency and Count
We will discuss it using MongoDB in detail in further tutorials. HBase can be used to perform its functions in a more effective way. You can also follow this link for details of this framework