MapReduce and Hadoop - Key Technologies for Big Data

This is the transcript of a podcast hosted by InetSoft on the topic of "Big Data: Its Definition and an Overview." The speaker is Mark Flaherty, CMO at InetSoft.

So we’re approaching big data in a couple of ways. MapReduce and Hadoop are key technologies associated with big data. There are a lot of questions out there surrounding these technologies. For instance, is it better to use Map Reduce or a data warehouse for big data?

That’s a good question. A lot of people are struggling with that, and there’s a lot of religious fervor on both sides of that camp. Should we use commodity hardware with a parallel Map Reduce for our analytics, or should we use a traditional data warehouse with relational tools and relational capabilities. And the answer is simple.

If you had a screwdriver you could pound in a nail with your screwdriver, but you might choose a different tool which would be more effective. So MapReduce and Hadoop which are two sides of the same coin. They provide process oriented parallelism, and they use a lot of process languages.

view demo icon
View a 2-minute demonstration of InetSoft's easy, agile, and robust BI software.

In contrast, relational databases have parallelism built into the data, and they handle the SQL language. So you use the correct tool for the job. And ultimately these are very complementary.

Now they do overlap some. You can run reports with MapReduce. A lot of people do that. You can do data mining with MapReduce. A lot of people do that. And there are some forms of data mining that work best in Hadoop and don’t work as well in the SQL data warehouse. In contrast, you can take tool like SAS or R-Analytics and put it inside the data warehouse and speed them up at enormous rates, and so they become very effective and valuable.

So the consumer of these tools, the customer, has a decision to make what’s the most optimum place to run each one of these tools. Each one of these functions for their business. And it’s never just a simple do it one way answer, but what we’ve decided is that these two tools should work together. They should exchange data. They should interact and make it easy for the customer to make that choice and change their mind when they want to change it.

History of Hadoop

Here's some of the history of Hadoop:

Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large scale computations to be trivially parallelized across large clusters of servers. Cutting, realizing the importance of this paper to extending Lucene into the realm of extremely large (web-scale) search problems, created the open-source Hadoop framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware. Cutting was an employee of Yahoo!, where he led the Hadoop project full-time; he has since moved on to Cloudera.

Architecture

Hadoop consists of the Hadoop Common, which provides access to the filesystems supported by Hadoop. The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community. For effective scheduling of work, every Hadoop-compatible filesystem should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, so reducing backbone traffic. The Hadoop Distributed File System (HDFS) uses this when replicating data, to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable.

Read how InetSoft was rated #3 for implementation in G2 Crowd's user survey-based index.

Previous: Big Data - Its Definition and an Overview