Big data a pre-requisite for data industry !

Big data a pre-requisite for data industry !

Big data a pre-requisite for data industry !

Did you liked it ??
+1
0
+1
0
+1
0
+1
0

Traditional Decision Making

Traditional decision making process is based on what we think. It also includes past experience and personal instincts and rule of thumb. In traditional decision making process decisions are made on pre-consist guidelines rather than facts.

Challenges of Traditional decision making

  1. Take a long time to arrive at a decision, therefore losing the competitive advantage.
  2. Requires human intervention at various stages.
  3. Lacks systematic linkage among strategy, planning, execution, and reporting.
  4. Provides limited scope of data analytics, that is, it provides only a bird’s eye view.
  5. It obstructs company’s ability to make fully informed decisions.

Big Data Analytics

The solution for traditional decision making is Big data analytics. Let’s see how –

  1. The decision is based on what you know which in turn is based on data analytics.
  2. It provides a comprehensive view of the general picture which may be a results of analyzing data from various sources.
  3. It provides streamlined deciding from top to bottom.
  4. Big data analytics help in analyzing unstructured data.
  5. It helps in faster deciding thus improving the competitive advantage and saving time and energy.

To understand this more easy way let’s consider an example of google self driving car. The self driving car collects lots of data from it’s sensors like camera’s, lidar, radars, etc. According to research the car produces around 1 GB () of data per second so it can be around 2 PB () of data per year assuming the car driver drives around 600 hours per year. This data generated is very important and it’s necessary to be stored. Currently servers are needed to store this data. Most of the data is coming in the real time and car needs to take decision every second using this large amount of data.

What is Big Data?

Big Data refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends and associations, especially relating to human behavior and interactions.

This data sets are so voluminous that traditional database management systems can’t handle them. They can be used to address the business problems we wouldn’t have been able to tackle before.

Big data is growing exponentially because of internet and fast growing technological advancements. In real time, every 60 seconds we have 98,000+ tweets, 695,000 status updates, 11 million instant messages, 698,445 google searches, 168 million+ email sent, 1,820 TB of new data data created, etc.

Different types of data

All this various types of data have gradual increasing  growth rate.

Structured data:- Data which have a defined data model, format, structure. Eg: Database.

Semi-structured data:- Textual data files with an apparent pattern, enabling analysis. Eg: Spreadsheets and XML files.

Quasi-structured data:- Textual data with erratic formats that can be formatted with effort and software tools. Eg: Clickstream data from web browsers.

Unstructured data:- Data that has no inherent structure and is usually stored as different types of files. Eg: Text documents, PDFs, Images, etc.

As we are growing we are creating more amount of unstructured data.

Four V’s of Big Data

Big Data is often described by the 4 V’s, each of which is a hard problem for relational database. Big Data is a collection of data from various sources. Often characterized by what become known as 4 V’s i.e. Volume, Variety, Velocity and Veracity.

Volume:

The ability to ingest, process and store very large datasets. The data can be generated by machine, network, human interaction on various systems, etc.

The data generated can be measured in petabytes or even Exabytes.

Overall amount of information produced everyday is rising exponentially. 2.3 trillion gigabytes of data is generated everyday on internet.

“Can you find the information you are looking for ?”

Variety:

Variety refers to data from different sources and types which may be structured or unstructured. The unstructured data created problems for storage, data mining and analyzing the data.

With the gradual growth in data, even the type of data has been growing fast.

Different variety of data is produced from social media, CRM systems, e-mails and audio, etc. Handling such complex data is challenge to companies. To handle such data analytics tools are used to segregate groups based on the type of data generated.

“Is the picture worth a thousand words? Is your information balanced?”

Velocity:

Velocity is the speed of data generation and frequency of delivery. It is the speed at which the data is coming in and how quickly it is analyzed and utilized.

The data flow is massive and continuous which is valuable to researchers as well as business.

For processing of data with high velocity tools for data processing known as streaming analytics were introduced.

“Is data generation fast enough?”

Veracity:

Veracity refers to the biases, noises and abnormality in data. This is where we need to be able to identify the relevance of the data and ensure data cleansing is done to only store valuable data.

You need to verify that the data is suitable for its intended purpose and usable within the analytic model. The data is to be tested against the set of defined criteria.

Inherent discrepancies in the data collected results in accurate predictions.

“Does it convey a message that can be shared with large audience?”

Common problems of Traditional systems.

  1. Unimaginable size of data.
  2. There are Heterogenous systems means there are different systems.
  3. Traditional systems do not scale up.
  4. Relational databases are costly.
  5. Building single system is complex and not cost effective.

Possible solutions can be either Scaling up or Scaling out.

But, what to choose Scaling up or Scaling out.

Scaling Up:

In this process we increase the configuration of single system, like disc capacity, RAM, data transfer speed, etc.

This is very complex, costly and time consuming process.

Scaling out

In this method we use multiple commodity (economical) machines and distribute the load of storage/processing among them. This process is quick to implement as it focuses on distribution of the load. It is an example of distributed systems.

Instead of having a single system with 10 TB of storage and 80 GB of RAM, we use 40 machines with 256 GB of storage and 2GB of RAM.

When compared Scale out is more effective than Scale Up.

But Scaling Out also has it’s own challenges.

Need of new system

We need new data bases rather than Relational databases which are capable of handling unstructured as well as structured data. To process huge data sets on large clusters (group of nodes in the networks) of computers than a single system.

To manage clusters

Sometimes in the clusters nodes fail frequently. If new node is added than number of nodes keep on changing. We also need to take care of the communication between the nodes.

During analysis, we need to take results from different machines and then merge and aggregate them accordingly.

Common infrastructure

You will need a common infrastructure for all your nodes which are efficient, easy to use and reliable.

 

Big data technology has to use commodity hardware of data storage and analysis. Furthermore, it has to maintain a copy of the same data across clusters.

Big data technology has to analyze data across different machines and then merge the data.

Solution for Big Data is use of one of the most important tool which is Hadoop.

So this article was actually a detailed introduction to Big Data scenario like what is big data, why it is used, what are the issues faced and what can be the possible solutions. In the coming articles we will study about the Big Data pipeline and architecture and will dive deep in Hadoop.

Stay tuned !

Did you liked it ??
+1
0
+1
0
+1
0
+1
0

Leave a Reply

Your email address will not be published. Required fields are marked *