MapReduce - CSVeda

MapReduce framework has 2 capabilities i.e. map and reduce. The implementation of this framework has evolved into much more since its introduction. It keeps the processing operations for parallel execution format. Problems are divided into smaller sub-problems that are executed independently. Then all sub-problem solutions are combined to get output. It is a core component of Apache Hadoop.

Features of MapReduce

Scheduling

Computing resources work in parallel by dividing problems into sub problems. This process of breaking tasks into sub-problems is known as Mapping. Scheduling of tasks are important for MapReduce. Mapping is based on the number of nodes in a cluster.

Error Handling

This is a very robust and fault-tolerant. Failures can take place on clustered nodes on which parts of the program are running. So the MapReduce must be able to identify the cause of the error to rectify it asap.

Synchronization

Execution of concurrent steps requires synchronization. This framework is in place. A mechanism known as shuffle and sort is used to collect the mapped data to prepare it for the reduction step. MapReduce also maintains the timings of the all the tasks. This enables it to start reduction after the mapping process has been completed.

Data Locality

The location of the code for MapReduce plays an important role in MapReduce as it determines the efficiency of the framework and the data required for its implementation. Best result is obtained when the code and data are present in the same system. The co-location of data and code gets the best results.

Scale Out Architecture

MapReduce model is built on architecture that enables us to accommodate more resources so that it can meet the higher computational demands of Big Data. So, the MapReduce engines are built in a way that it can get more machines whenever required.

How does MapReduce work?

MapReduce model can work on analysis data operations. If multiple operations are being performed on a dataset using MapReduce, the original dataset must remain the same in all cases whenever required. MapReduce model executes task by dividing it into functions i.e. map and reduce. Map is executed first in parallel systems. Reduce function takes the output of the map function to get the final output in an aggregated format.

Here are the steps in which MapReduce algorithm for map and reduce operations works:

Take Large dataset/records of data
Iteration of data
Extract interesting patterns as initial preparation for output using map function.
Optimization for further processing
Computation of results using reduce function
Get final output result

The Flow of data in MapReduce framework is:

Input -> Split -> Map -> Combine -> Shuffle and Sort -> Reduce -> Output

The input for Operation is provided in key-value pairs.
Input data is divided into small batches.
Master and Slave nodes are created as required.
Master executes on the machine where the data is located.
Slaves work remotely on data.
Map operation performed simultaneously. Map function extracts relevant data and generates key value pairs.
Master instructs reduce function to take further actions for reduce operation on the data.
Reduce function sorts data on the basis of the key field. This process of collecting output from the map function and then sorting is also known as Shuffling.
Every unique key is taken by reduce function.
The output is generated by the reduce function and the control is handed over to the user from the master.

Uses of MapReduce

MapReduce is very useful in various sectors and domains. Some use cases for using it are given below:

Website Traffic and Visitors
Word Frequency and Count

We will discuss it using MongoDB in detail in further tutorials. HBase can be used to perform its functions in a more effective way. You can also follow this link for details of this framework