What is Hadoop?

Though Hadoop has generated a lot of conversation in the past several years, there still seems to be a general confusion about the technology. In fact, it’s still so new and confusing that 1900 people, every month, Google, “What is Hadoop?”

If you’re in this camp, wondering about Hadoop, don’t worry. Today, we will explain Hadoop, why it emerged, and how it works. You’ll be part of the ‘in the Hadoop know’ crowd in no time. Won’t your colleagues be impressed?

 

 

How did Hadoop Come Around?

As the Internet grew, there was more data than ever before. There was a problem with this, what is hadoophowever, as the old systems couldn’t handle the volume of information. It would take too long for a normal computer to process, and the processing was only slightly quicker on the most expensive of machines. There was inadequate tech to take the vast amount of incoming information and make it accessible to businesses.

Google was one of the companies that was frustrated by the inaccessibility of their data. They knew they were receiving useful information, but had no way of analyzing it and using it to make their own system more successful. So Google invented their own platform. This technology was then used by Nutch, an open source project, and from this collaboration, Hadoop evolved. One of the men noted as most responsible for this project was Doug Cutting. It is rumored that he named the system Hadoop after his son’s toy elephant.

 

What is Hadoop?

In a nutshell, Hadoop is the norm through which organizations now store, process, and analyze data. A cluster of commodity servers processes incoming information in parallel, and present it in such a way that organizations can pull from it, analyze it, and use the results to improve their business practices. The form of data we’re talking about is irrelevant: it can be structured or unstructured, emails or log files. Due to the cluster of servers, there is never too much data for a system to handle, and downed data processing systems no longer exist. Hadoop changed the game by making it possible to process large quantities of information continuously, and by making these numbers accessible to organizations.

 

 

The Hadoop Architecture

Traditionally, if you wanted to analyze data, you needed to have an expensive computer with equally expensive programs. Even with the amount of money put into this technology, it wouldn’t necessarily be able to handle the incoming onslaught of information.

With Hadoop, you don’t have to use the expensive supercomputers, just normal computers. In this system, there is a central disk, which is attached to a number of processors, all of which are attached to a number of CPUs.

hadoopWhen the information is uploaded to this Hadoop system, it is split, and goes to the different servers. Each server works on analyzing their own specific piece of information, the results of which comes back to make a whole picture for the organization. This process, of first using several computers to analyze data where it is and then bringing the results back together, is called MapReduce. It is central to Hadoop.

Google, Twitter, Yahoo and Facebook all use MapReduce to sort through their information. In fact, Yahoo used this process to sort through a terabyte of information in 62 seconds.

 

Hadoop has changed the way that information is processed. We can now take the endless information we are getting from the Internet and other business practices, process it quickly, and analyze it. Businesses can alter their practices, and keep the data on file for future reference. Hadoop is revolutionizing how we think about big data.

 

Have you worked with Hadoop? What are your thoughts on the process? Let us know in the comments section below, or join the conversation on Facebook, Twitter, and LinkedIn.

Looking for more information like this? Check out other blog posts on this topic by clicking on the button below:

Technical Topics


Thanks to Rob Young and webtreats for the use of their respective photographs.