The Big Data Architecture consists of all the components designed on which a big data model will be based. Big data environment must perform the following functions:
- Identifying relationships and patterns
- Analysing Data
- Sorting and organizing data
- Capturing data from different sources
- Deriving conclusions and analysis on the basis of results
Big Data Architecture- Components and Layers
The Big Data architecture has the following layers and components:
Data Sources Layer
Data Sources layer is responsible for the absorption and integration of raw data incoming from the sources. The velocity and variety of the incoming data can differ. This data needs to be validated and noise removal must be done before enterprises can use it in any logical sense.
Ingestion Layer
All the data cleansing and noise removal takes place in this layer. It separates information from noise in the raw data. All the steps in Ingestion Layer are discussed below:
Identification
Data is characterized into various data formats and is assigned the default formats for further processing.
Filtration
Data filtration is the process to get relevant data based on the Master Data Management repository.
Validation
Data Validation is the process of making sure that the data is analysed and is in the correct format with respect to the MDM metadata.
Noise Reduction
Noise Reduction is the process of eliminating unrelated data from the big data that we got after the Validation step.
Transformation
Transformation is the step that splits/combines and maps the data on the basis of data type, content and requirement of the enterprise.
Compression
Compression is the stage where the data size is reduced without decreasing the quality of its contents. The compressed data must not compromise the analysis results.
Integration
In this stage, the compressed and refined data is integrated with the Hadoop Storage Layer that includes HDFS and NoSQL databases.
Storage Layer
Storage Layer helps in the storage of large volumes of data in a distributed format. The Hadoop Storage Layer is fault tolerant and supports parallelization. The components of Hadoop are: Hadoop Distributed File System (HDFS) and MapReduce Engine which helps to run data in batches.
HDFS is a file system that stores high volumes of data in clusters. HDFS stores data in the form of block of files on a write-once-read-many model. HDFS is able to handle management of streaming data in a more effective manner when compared to other similar technologies.
Physical Infrastructure Layer
A robust and inexpensive physical infrastructure can be implemented for handling the big data. This layer is based on a distributed computing model. The principles of Physical Infrastructure Layer are:
Performance
The high end infrastructure must deliver high performance along with low latency. The measurement of performance is based on the transactions taking place on the data.
Availability
The infrastructure must be available at all time for uninterrupted service guarantee. The businesses must have 24×7 access to the infrastructure without any failure.
Scalability
The big data infrastructure must be scalable to accommodate all the storage and computing resources. They must be fault tolerant as well.
Cost
The Infrastructure must be affordable in all the areas such as networking, hardware and storage. All the parameters of budget must be considered before using the Hadoop infrastructure.
Flexibility
Flexibility includes adding more resources to the existing infrastructure along with failure recovery systems. Cloud computing elements can also be used if costs need to be minimized.
Platform Management Layer
This layer provides tools and programming languages for NoSQL databases. This layer uses HDFS on top of Hadoop Physical infrastructure layer. Hadoop contains various tools to help store, access and analyse large volumes of streaming data using real time analysis tools.
Key elements of Platform Management Layer are:
- Pig
- ZooKeeper
- Hive
- Sqoop
- MapReduce
Security and Reliability Layer
It handles all the security measures that must be included in Big Data model and Big Data Architecture. Big Data uses distributed systems in security layer. Some security checks that are a pre requisite for security in big data are:
- It must authenticate nodes by using several protocols.
- A secure communication among nodes must be maintained.
- It must have file layer encryption
- It must include logs of failures and abnormalities among communication nodes.
- It must subscribe a key management service.
Monitoring Layer
Monitoring Layer uses monitoring systems that provide machine communication and monitoring. These monitoring systems remain aware of all the configurations and functions of the OS as well as the hardware. The machine communication is provided with the help of high level protocols like XML. Some tools for monitoring big data are Nagios etc.
Analytics Layer
Analytics layer contains analytics engine that is used to analyse huge amount of data (usually unstructured). The analysis can be text analysis, statistical analysis etc.
Big data Analytics Engines are classified into 2 types:
Search Engines
It required very fast search engines and cognitive data discovery system to analyse tremendous volumes of data. The data must be indexed and searched for analytical processing.
Real time Engines
Real time applications generate high volumes of data at a very fast speed. Real time engines are required to perform analysis for big data environment for this type of processing.
Visualization Layer
Visualization layer involves with the visualizing and interpreting of big data. It is very crucial as it gives a great deal of information about the data at a glance. Data Visualization has many techniques and methods which can be used for simulations and also deriving conclusions of the big data. It works on top of data warehouses and Operational Data Stores (ODS). Some examples of popular visualization and dashboard tools are Tableau, Spotfire, D3, DataWrapper etc.
Visualization in Visualization Layer can be carried out with the help of the following approaches: Server Visualization, Network Visualization, Data and Storage Visualization, Application Visualization.