It contains 4 main layers with Hadoop Ecosystem Components which are interlinked in terms of Hadoop architecture. It comprises of various complex as well as evolving tools used for big data. These tools help big data engineers to provide big data solutions in a cost effective and fast manner.
Hadoop Ecosystem Components
The components of Hadoop Ecosystems are described here:
DATA STORAGE STAGE
HDFS
HDFS (Hadoop Distributed File System) is resilient, flexible file system specifically used for Hadoop applications. HDFS consists of a central node with its multiple data nodes. HDFS along with Hbase gives a strong structure to the big data systems. They also provide integration services and is used in real time big data applications. They are the core components of the Hadoop ecosystem. HDFS in general is a file system used to store and retrieve big data. It uses a master – slave architecture. It is based on java programming language.
HBase
Hbase is used as a column oriented data storage. HBase is also a core component of hadoop’s ecosystem. It allows for dynamic changes which can be used for various applications. It was developed based on the Google’s BigTable architecture which is another storage system for unstructured data. Apache HBase provides great capabilities which is made on top of Hadoop and HDFS. Tech giants such as Facebook, Adobe and Twitter use HBase. Some advantages of HBase are consistency, high level of availability, IT operations support.
DATA PROCESSING STAGE
MapReduce
MapReduce has been developed by Apache to implement algorithms made in the Hadoop project. It is based on parallel processing and us used to send large amounts of data. MapReduce is used to process large amount of data over different systems across different locations. As the name suggests, MapReduce work in 2 main operations i.e. Map and Reduce. These operations occur in parallel by the worker nodes.
Hadoop YARN
Yet Another Resource Navigator (YARN) is used for management of hadoop. It is a core component of hadoop service. Components of YARN include Node Manager, Resource Manager, and Application Master. It enables Hadoop to process other purpose-built data processing system other than MapReduce. It allows running several different frameworks on the same hardware where Hadoop is deployed.
DATA ACCESS STAGE
Pig
pig is a platform for analysing large amounts of datasets of High level Language for data analysis. The structure is very optimal for data parallelization. Pig can also be said as an ETL tool which makes Hadoop more approachable and easier to understand to non technical users. Pig Latin is a variant of Pig which makes it a more interactive and script oriented environment which is also easier to understand for non programmers. Pig Latin loads and processes the input data using a series of operations to transform and provide the required output.
Hive
Hive is a data warehousing layer developed using HDFS and MapReduce that is used to support batch oriented processes. HiveQL is a SQL like query language used for used by Hive. It interacts via mappers and reducers. It provides SQL like access to structured data. It is very similar to traditional databases. It also uses mappers and reducers like the Hadoop ecosystem. It does not have very quick database access functionality. It can easily be used for data mining, analysis and real time processing.
Sqoop
Sqoop is a tool specifically made for data transfer operation. Data between the Hadoop and relational databases can be transferred using this tool. It uses command line interpreter, and executes queries sequentially. It can also be easily used by non programmers as well. It also relies on other Hadoop technologies such as MapReduce and HDFs.
DATA MANAGEMENT STAGE
Oozie
Oozie is another open source Apache Hadoop service. It is used to manage and process the submitted Hadoop jobs. It is highly scalable as well as extensible. It is basically a dataware service which has cross platform dependencies such as for HDFS, Pig and MapReduce.
Zookeeper
Zookeeper is used to coordinate working elements of distributed elements of Hadoop. It uses divide and conquer approach after a problem is assigned. It is open source and provides centralized service for providing configuration information, synchronization and group services over large clusters in distributed systems. Its primary goal is to make the systems easier to manage with reliable dissemination of changes.
Flume
Apache Flume is used for transferring big data from distributed to single centralized resources and repositories. It is very robust and fault tolerant and is used to collect, assemble and transfer data. It does real time data capturing in Hadoop. It is optimal to arrange a wide variety of data such as social networking, emails and business transactions. It can also be used for online data analytics. It is often used to move large amounts of streaming data into HDFS. An example where flume can be used is collecting logs data from different systems into HDFS for analysis.
The Hadoop Ecosystem Components are responsible for handling all the tasks required in Big Data Management.