To understand what Hadoop is, we must first understand the problems associated with big data and traditional processing systems. Moving forward, we will discuss what Hadoop is and how Hadoop solves problems related to big data. We will also examine the CERN case study to highlight the benefits of using Hadoop.
In our previous blog " Big Data Tutorial", we have discussed Big Data and its challenges in detail. In this blog, we will discuss:
1. Problems with Traditional Approaches
2. Evolution of Hadoop
3.
I.CERN Case Study
Big data is becoming an opportunity for organizations. Organizations are now realizing that they can gain a lot of benefits from big data analytics as shown below. They are examining large data sets to discover all the hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.
These analytics are helping organizations with more effective marketing, new revenue opportunities, better customer service. They are improving operational efficiency, competitive advantage over rival organizations, and other business benefits.
What is Hadoop - Benefits of Big Data Analytics
So let's move on and understand the issues associated with traditional approaches in cashing in on big data opportunities.
II.Problems of Traditional Approach
In traditional approach, the main problem is to deal with heterogeneity of data i.e. Structured, Semi-structured and Unstructured.RDBMS mainly focuses on Structured data such as Bank Transactions, Operational Data etc. whereas Hadoop focuses on Semi-structured, Unstructured data such as Text, Video, Audio, Facebook Posts, Logs etc. RDBMS technology is a proven, highly consistent, and mature system supported by many companies. On the other hand, Hadoop is in demand due to big data (mainly consisting of unstructured data in different formats).
Now let's understand what are the major issues associated with big data. So, moving on, we can understand how Hadoop is the solution.
What is Hadoop - Big Data Problems
The first problem is storing large amounts of data.
It is not possible to store large amounts of data in a traditional system. The reason is obvious, storage would be limited to one system and data is growing at an alarming rate.
The second problem is storing heterogeneous data.
Now, we know that storage is an issue, but let me tell you that it is only part of the problem. Since we have discussed that data is not only huge but also exists in various formats such as: unstructured, semi-structured and structured. Therefore, you need to make sure that you have a system to store all these kinds of data generated from various sources.
The third issue is access and processing speed.
Hard drive capacity is increasing, but disk transfer speeds or access speeds are not increasing at a similar rate. Let me explain this to you with an example: if you have only one 100 Mbps I/O channel and are processing 1 TB of data, it takes about 2.91 hours. Now, if you have four computers with one I/O channel, it takes about 43 minutes for the same amount of data. Thus, access and processing speed are bigger issues than storing big data.
Before we understand what Hadoop is, let us first understand how Hadoop has evolved over time.
Evolution of Hadoop
In 2003, Doug Cutting started the Nutch project to process billions of searches and index millions of web pages.Late October 2003- Google releases GFS (Google File System) with a In December 2004, Google released the MapReduce paper. In 2005, Nutch was operating with GFS and MapReduce. 2006, Yahoo worked with Doug Cutting and his team to create Hadoop based on GFS and MapReduce. you would be surprised if I told you that Yahoo started using Hadoop on a 1000 node cluster in 2007.
In 2008, Yahoo started using Hadoop on a 1000 node cluster. /p>
In late January 2008, Yahoo released Hadoop as an open source project to the Apache Software Foundation.In July 2008, Apache successfully tested a 4,000-node cluster with Hadoop.In 2009, Hadoop successfully collated petabytes in less than 17 hours of data in less than 17 hours to process billions of searches and index millions of web pages. In December 2011, Apache Hadoop released version 1.0. In late August 2013, version 2.0.6 was released.
When we discussed these problems, we realized that distributed systems can be the solution and Hadoop provides the same. Now, let's understand what Hadoop is.
Thirdly, what is Hadoop?
Hadoop is a framework that allows you to first store big data in a distributed environment so that you can process it in parallel. There are basically two components in Hadoop:
1. Big Data Hadoop Certification Training
2. Instructor-led courses real-life case studies assessments lifetime access exploration courses
What is Hadoop - Hadoop framework
The first one is for storage HDFS (Hadoop Distributed File System), which allows you to store data in various formats in a cluster. The second one is YARN for resource management in Hadoop. It allows parallel processing of data i.e. across HDFS storage.
Let's start by understanding HDFS.
HDFS
HDFS creates an abstraction, let me simplify it for you. Similar to virtualization, you can logically think of HDFS as a single unit for storing big data, but in reality you are storing data across multiple nodes in a distributed fashion.HDFS follows a master-slave architecture.
What is Hadoop - HDFS
In HDFS, the name node is the master node and the data node is the slave node. Namenode contains metadata about the data stored in Data Nodes, such as which data block is stored in which Data Node, where is the replicated location of the data block, etc . The actual data is stored in Data Nodes.
I would also like to add that we actually replicate the data blocks that exist in the Data Nodes with a default replication factor of 3. Since we are using commercial hardware and we know that these have a high failure rate, if one of the DataNodes fails, HDFS will still have a copy of those lost data blocks. You can also configure the replication factor as needed. You can read the HDFS tutorial to learn more about HDFS.
Four: Hadoop as a Solution
Let's understand how Hadoop can provide a solution to the big data problems just discussed.
What is Hadoop - Hadoop as a Solution
The first problem is storing big data.
HDFS provides a distributed way of storing big data. Your data is stored in blocks across the DataNode and you can specify the size of the blocks. Basically, if you have 512MB of data and have configured HDFS, it will create 128MB chunks of data. So, HDFS divides the data into 4 blocks of 512/128 = 4 and stores them on different DataNodes and also replicates the data blocks on different DataNodes. Now, since we are using commodity hardware, storage is not a problem anymore.
It also solves the scaling problem. It focuses on horizontal scaling rather than vertical scaling. Instead of scaling the resources of DataNodes, you can always add some additional data nodes to the HDFS cluster whenever you need them. Let me summarize for you, basically for storing 1 TB of data, you don't need a 1 TB system. You can perform this on multiple systems of 128 GB or less.
The next issue is storing various data.
With HDFS, you can store all kinds of data, whether structured, semi-structured or unstructured. As in HDFS, there is no pre-dump schema validation. And it also follows write once and read many model. Hence, you can write data only once and read it multiple times to find insights.
The challenge with Hird is to access and process data faster.
Yes, this is one of the main challenges of big data. To solve that problem, we move the processing to the data instead of moving the data to the processing. What does that mean? Instead of moving the data to the master node and then processing it. In MapReduce, the processing logic is sent to each slave node and then the data is processed in parallel between the different slave nodes. The processed results are then sent to a master node where the results are merged and the response is sent back to the client.
In the YARN architecture, we have ResourceManager and NodeManager.ResourceManager may or may not be configured on the same machine as NameNode. However, NodeManager should be configured on the same machine where DataNode is present.
YARN performs all your processing activities by allocating resources and scheduling tasks.
What is Hadoop - YARN
It has two main components, ResourceManager and NodeManager.
ResourceManager is again the master node. It receives the requests for processing and then passes the various parts of the request accordingly to the appropriate NodeManager where what is big data analytics Hadoop does the actual processing.NodeManager is installed on each DataNode. It is responsible for performing tasks on each individual DataNode.
I hope now you have an understanding of what Hadoop is and its major components. Let's move on and understand when to use and when not to use Hadoop.
When to use Hadoop?
Hadoop is used for:
1. Search - Yahoo, Amazon, Zvents
2. Log processing - Facebook, Yahoo
3, Data Warehousing - Facebook, AOL
4, Video and Image Analytics - New York Times, Eyealike
So far, we have seen how Hadoop makes big data processing possible. But in some cases, Hadoop is not recommended.