Mass-scale computing: Why Hadoop is hot but Java is not

With the massive amount of data proliferating the Web, companies such as Google and many others are building new technologies to sort it all. Core to that movement is something called MapReduce, a software technique that breaks down huge amounts of data into smaller bits. Operating on the smaller bits, and then piecing results together to form the big picture again has proven extremely successful.

MapReduce was introduced by a paper from Google. Although Google's implementation (written in C++) remains proprietary, their paper inspired the open source Hadoop implementation (written in Java), which has become quite popular. In fact, Hadoop has become such a hot item, that Amazon recently announced an elastic Hadoop web service! And Yahoo is Hadoop's biggest contributor, using it in its web search and advertising businesses.

One company called Cloudera (which VentureBeat recently profiled) is commercializing the Java-based open source Hadoop implementation, but that may not be the most efficient.

Having a look at these different approaches, is very timely, given the upcoming Hadoop World NYC, where the focus is broadening Hadoop's use in many other industries such as finance, telecom, biotech, and retail. It's become such a critical component in the expected skill-set of today's engineers that IBM and Google teamed up to form the Academic Cluster Computing Initiative (ACCI) to get the next crop of engineers quickly up to speed on the new programming model.

When we talk about mass-scale, we need to think about cluster sizes of 10,000 servers and beyond. Hadoop runs on up to several-thousand server clusters, with a goal of scaling to 10,000. Google reportedly has many sub-clusters, each perhaps of this magnitude, but some estimate their total cluster size to be greater than 500,000 servers. And they're looking ahead to when they'll have on the order of 10 million machines! At this scale, one really needs to pay attention to efficiency, and Google is certainly on a relentless pursuit of incremental gains in efficiency.

So what if I told you that the Hadoop platform and Java-based Hadoop applications are 30% less efficient by nature than anything Google has, purely from a software standpoint? That'd be a big deal for a company in any industry, aspiring to scale massively right? Yet, this is what I found, and it's due to the inefficiencies of Java vs C++. First, Java hogs extra resources (cache, memory, CPU cycles), a fact that doesn't always show up well in single benchmark tests, but does show up clearly when multiple Java benchmarks compete for resources at the same time. This alone loses Java 15%, as tests I ran showed.

On top of that, on average, Java loses to C++ by about 15%, especially when apps can be compiled for mass-scale computing (use profile-guided compilation, etc). So you're down 30% to start with by implementing a Java Hadoop app on top of the Java-based Hadoop infrastructure. Now, you don't have to write your app in Java; you can use C++ or even script languages. See for example, why the Hypertable project chose C++. But unfortunately, the choice of Java for infrastructure and the bevy of available libraries is driving many people to use Java for Hadoop. Let's look at what that means financially as we scale out increasingly.

Even for current Hadoop cluster sizes, 30% inefficiency adds up per-year. But somewhere between where the Google server-count may be currently and where it's heading, the yearly cost of that level of inefficiency crosses well beyond the billion dollar per year mark! Plug in your own numbers; at large scales, inefficiency is costly however you slice it. If your competitors are more efficient, you either lose in terms of dollars, processing capacity, the ability to extract higher premiums, or otherwise. Plan very carefully. Inefficiency scales extremely well...

Kevin Lawton is serial entrepreneur. He wrote the first academic paper on x86 virtualization, which sparked the multi-billion dollar x86 virtualization industry.

More