The enormous volume, variety, veracity and velocity of data is termed as ‘Big Data’. It is difficult to handle by using traditional computing systems. To counter that, the huge amount of data is stored and managed in a distributed file system. These file systems need to be fault-tolerant, scalable and equally flexible at the same time to handle big data. It is said that we create about 2.5 quintillion bytes of data in a day. This data can come from various sources ranging from videos, images, audio files etc. Hadoop is a great tool for such data management.
Big data needs to be collected and stored for its productive usage. As data is becoming the new form of currency, managing big data is becoming more important than ever. To get meaningful insights from big data, high form of processing power and analytics platforms are required. Big data is also becoming a challenge as data is being generated faster than ever at an unprecedented volume.
Getting meaningful information from big data can help a lot of businesses and organizations to understand crucial trends for the purposes of profitability. Even in medical fields, it is becoming important to analyse the patients data to handle diseases by finding optimal cures. A variety of analytics can be found by using big data.
Today, we live in an era of information explosion which can be defined as the unprecedented increase in the volume of the data and the effects that follow. Using big data for pattern recognition is becoming a common practice among various industries. Steps such as data extraction, preparation and integration are used for establishing analytical goals.
Types
Unstructured Data
Unstructured data consists of data or metadata that is in inconsistent formats. Unstructured data does not have any predefined data format and may consist of data such as audio, video, text, emails, etc.
Structured Data
Structured data is organized into a specific format. It is mostly in a fixed tabular manner. Structured data has all its rows and columns mapped across each other. Pre-determined data types exist for the structured database. Some examples include relational databases, Flat files (like .csv/.tsv)
Semi-Structured Data
Also known as schema-less data, it is a type of data that comprises of tags to identify various type of hierarchies among records. It doesn’t have any fixed structure. Some examples are data exchange formats, cookies in websites etc.
Big Data growth can be described in terms of the 4Vs:
Volume
Volume refers to the amount of big data generated by any given source or source. Many exabytes of data has been produced over the years. Internet is one of the largest producers of big data.
Velocity
Velocity is defined as the rate at which the data is being generated, shared, stored, and managed. The current processing computer systems need to keep up with the extreme volumes of data being generated today. In most of the systems, big data is processed in batches in parallel or at intervals of a few hours. High-velocity data generators include mobile devices, firewalls, social media among others.
Variety
Variety inculcates all the different sources from which data is being generated. Sometimes even a single generator can generate multiple types of data files. Ex. Social media applications generate text, audio as well as video data.
Veracity
Veracity refers to the consistency and accuracy of the data being generated. Only the most precise data can be used for analysis purposes. The lack of veracity is the reason why sometimes the processing of semi-structured and unstructured data can be tedious and time-consuming.
Applications Areas
Here are some of the important areas where voluminous data plays an important role.
- Advertising
- Finance and management
- Medical and Medicine Discovery
- Manufacturing and consumer market
- Social Media Analytics
- Marketing Analytics
- Business Intelligence
- Cyber Security