Big Data

How to conquer ‘big data’ with MapReduce & MPP

This is a guest post by Walt Maguire, Analytics Director at ParAccel

The emphasis on “big data” has grown mightily over the last year, as more companies strive to draw useful intelligence out of increasingly massive data volumes from web clickstreams, sensor data, social media data and other large datasets.

One technology approach has dominated the discussion: MapReduce. MapReduce is open-source technology used for distributed programming, and its current incarnation “Hadoop” (named for its inventor’s son’s stuffed elephant), has been trumpeted as the new solution on the scene, the silver bullet for getting value from big data.

But while MapReduce and Hadoop are interesting and useful, the approach is nothing new, nor a panacea. While often cost-effective for inexpensive data storage and lightweight data processing, running analytics on Hadoop data has been challenging. Early adopters report that analytics in Hadoop are very slow to process — a big problem for analyzing giant data sets — and complex to write, due to not supporting SQL (structured query language, the lingua franca of analysts.)

Newsflash: Other technologies that solve many of the same problems have existed for decades, namely Massively Parallel Processing (MPP) databases, which are known for speedy processing of analytics and robust SQL support.

However, the MPP and Hadoop approaches are not mutually exclusive. Hadoop and MPP databases are increasingly used together by forward-thinking companies for a complete big data infrastructure that is cost-effective and leverages the best of both technologies.

Let’s compare the two approaches and look at a few specific examples of how they can be combined.

Map/Reduce and Hadoop evolution

At the heart of MapReduce are two functions called, unsurprisingly, Map and Reduce. A Map function’s role in life is to take some input data such as a list of words, apply some function and then map those inputs to output data. A Reduce function will take the outputs from a Map, and apply a function to reduce the input data into usable output data.

In the world of big data, divide and conquer is a must if we’re to cope with the data volumes generated today.

Example of a Map/Reduce function:

MapReduce

The initial driver behind the development of MapReduce was a paradigm shift in computer programming during the 1990s towards an approach called “functional programming.” Not long after it was first used at Google to speed up its indexing of the World Wide Web in 2004, the open source MapReduce platform, Hadoop, was developed. Hadoop delivered a reasonably complete way to develop distributed MapReduce programs. It had numerous gaps, but for those analyzing 10,000x as much data as they were five years ago, it helped.

The recent uptake of Hadoop has been driven in part by necessity. With the exponential growth of the Internet, machine data and the trend toward “saving everything,” organizations have more data than ever before, much of it in unstructured forms. So an innovation first created as a programming technique has been pressed into service as a specialized platform for distributed data processing. While it’s good at functions traditionally performed by ETL tools, it’s not as good at providing fast answers to questions.

Organizations risk finding themselves with a large repository of data in Hadoop that they can’t analyze very well.

Massively Parallel Processing (MPP) evolution

For many of the tasks necessary in processing and analyzing big data today, the Massively Parallel Processing (MPP) database is better. MPP databases also split up complex, large volume jobs into units processed across multiple nodes. While they don’t act exactly like MapReduce, they accomplish many of the same things, and are far better at some things.

MPP databases provide things taken for granted by database users for decades such as ACID compliance — meaning you will get predictable answers to questions. This isn’t enforced in Hadoop. Also, MPP databases include cost-based optimizers and monitor the distribution of data within the system; and as a result, they are generally an order of magnitude more efficient than Hadoop. So you can do things ten times more quickly, or do the same thing with one-tenth the infrastructure.

MPP databases do not solve every problem. For example, when the structure of incoming data is unknown or variable, an MPP database requires that this be structured at load time. So a measure of data manipulation must take place to prepare it. Also, appliance-based MPP systems can be difficult and costly to expand, whereas Hadoop is designed to run on any hardware. Software-based MPP database solutions don’t have this problem.

The following table compares and contrasts Hadoop/MapReduce with MPP databases.

Hadoop/MapReduce MPP Databases
Why Invented? Expand existing programming technology into large scale processing Expand existing database technology into large scale processing
Who Invented? Open source community Teradata, Netezza, GreenPlum, Vertica, ParAccel, etc.
What does it do? Divide a single large problem into smaller units for processing across a distributed system Divide a single large problem into smaller units for processing across a distributed system
Language Java+pig+HQL+etc. SQL
Pluses You can control everything
Can run on low cost HW
Good at unstructured data
Easy to deploy and use
Uses well-known SQL syntax and supports SQL-based BI tools
High-performance
Minuses Lower performance
Programming requirements
Doesn’t support SQL-based BI tools
Open source ownership
Upfront investment
Unstructured data requires pre-processing

Comparisons from the real world

Many firms have brought in both technologies for their big data infrastructure.

One large retailer found that Hadoop and an MPP platform are complementary. The company ingests large amounts of unstructured data and archives it at low cost with Hadoop; it then loads the data needed for analytics into the MPP platform via the vendor’s proprietary, high-speed integration module. Now, this retailer can run jobs 200x faster than its previous data warehouse, enabling more granular market basket analysis and customer segmentation. It leveraged Hadoop for low-cost storage and an analytic platform for doing the actual analysis,  cost-effectively solving a number of key problems with this combination of technologies.

This is a common model these days. Evernote, a Redwood City, Calif.-based developer of note taking and organization software, has a similar architecture, using Hadoop for low-cost data storage and processing of web application log data, combined with an MPP platform for analytics. For them, it was faster to move the data to a platform purpose-built for analytics than to try to run the analytics within Hadoop.

Evernote CTO Dave Engberg provides much more detail and a summary in the company’s Tech Blog:

Hadoop is great for cheaply storing a ton of data and performing parallel batch processing jobs in minutes instead of hours (or days)…But it’s not particularly quick for more complicated analyses that combine multiple different sets of data.…

Overall, the new infrastructure has met our goals. We can load and transform hundreds of millions of records in two hours instead of 10+, we’re generating far more (and far better) reports, and we can safely perform much more complex analyses of user trends than we could before.

In summary, there is a useful place for MapReduce and Hadoop in the big data landscape, but MPP technologies also offer significant advantages. Companies should strongly consider using both together to deliver big data infrastructures.


VentureBeat is studying mobile marketing automation. Chime in, and we’ll share the data.
0 comments