Mass-scale computing: Why Hadoop is hot but Java is not

kevinlawtonWith the massive amount of data proliferating the Web, companies such as Google and many others are building new technologies to sort it all. Core to that movement is something called MapReduce, a software technique that breaks down huge amounts of data into smaller bits. Operating on the smaller bits, and then piecing results together to form the big picture again has proven extremely successful.

MapReduce was introduced by a paper from Google. Although Google’s implementation (written in C++) remains proprietary, their paper inspired the open source Hadoop implementation (written in Java), which has become quite popular. In fact, Hadoop has become such a hot item, that Amazon recently announced an elastic Hadoop web service! And Yahoo is Hadoop’s biggest contributor, using it in its web search and advertising businesses.

One company called Cloudera (which VentureBeat recently profiled) is commercializing the Java-based open source Hadoop implementation, but that may not be the most efficient.

Having a look at these different approaches, is very timely, given the upcoming Hadoop World NYC, where the focus is broadening Hadoop’s use in many other industries such as finance, telecom, biotech, and retail. It’s become such a critical component in the expected skill-set of today’s engineers that IBM and Google teamed up to form the Academic Cluster Computing Initiative (ACCI) to get the next crop of engineers quickly up to speed on the new programming model.

When we talk about mass-scale, we need to think about cluster sizes of 10,000 servers and beyond. Hadoop runs on up to several-thousand server clusters, with a goal of scaling to 10,000. Google reportedly has many sub-clusters, each perhaps of this magnitude, but some estimate their total cluster size to be greater than 500,000 servers. And they’re looking ahead to when they’ll have on the order of 10 million machines! At this scale, one really needs to pay attention to efficiency, and Google is certainly on a relentless pursuit of incremental gains in efficiency.

So what if I told you that the Hadoop platform and Java-based Hadoop applications are 30% less efficient by nature than anything Google has, purely from a software standpoint? That’d be a big deal for a company in any industry, aspiring to scale massively right? Yet, this is what I found, and it’s due to the inefficiencies of Java vs C++. First, Java hogs extra resources (cache, memory, CPU cycles), a fact that doesn’t always show up well in single benchmark tests, but does show up clearly when multiple Java benchmarks compete for resources at the same time. This alone loses Java 15%, as tests I ran showed.

On top of that, on average, Java loses to C++ by about 15%, especially when apps can be compiled for mass-scale computing (use profile-guided compilation, etc). So you’re down 30% to start with by implementing a Java Hadoop app on top of the Java-based Hadoop infrastructure. Now, you don’t have to write your app in Java; you can use C++ or even script languages. See for example, why the Hypertable project chose C++. But unfortunately, the choice of Java for infrastructure and the bevy of available libraries is driving many people to use Java for Hadoop. Let’s look at what that means financially as we scale out increasingly.

Even for current Hadoop cluster sizes, 30% inefficiency adds up per-year. But somewhere between where the Google server-count may be currently and where it’s heading, the yearly cost of that level of inefficiency crosses well beyond the billion dollar per year mark! Plug in your own numbers; at large scales, inefficiency is costly however you slice it. If your competitors are more efficient, you either lose in terms of dollars, processing capacity, the ability to extract higher premiums, or otherwise. Plan very carefully. Inefficiency scales extremely well…

Kevin Lawton is serial entrepreneur. He wrote the first academic paper on x86 virtualization, which sparked the multi-billion dollar x86 virtualization industry.

Next Story:
Previous Story:

  • Andrew
    I generally agree with the analysis of Hadoop versus C++ implementations of massively distributed systems. C++ is substantially more efficient in material ways for these kinds of applications at scale, and that advantage is not going away. I would even agree that Google's implementation is substantially ahead of the open source variants by a significant margin. It comports with my experience and I think it makes sense in the abstract.

    However, there is an elephant in the room that is ignored here. MapReduce, C++ or Java, is incredibly inefficient for most types of analytics or complex search. That 15-30% improvement amounts to naught if the analytics are intractable by virtue of using a vanilla MapReduce model. It only fits a very limited application space; if your application is outside that scope, the efficiency of MapReduce is very, very poor.

    I think there is an assumption that MapReduce is the future in some sense, but it really isn't. It simply does not do enough to be that piece. I work in the application-oriented analytics space and the number of those problems that fit into a traditional MapReduce model is a tiny minority, and none of those applications are the interesting ones. Its repertoire is too limited no matter the efficiency of the implementation.
  • sean
    Agreed totally.

    Those who boast MR or Hadoop seem not to understand computing very much.

    MR is only good at one thing, that is SIMD.
  • kevinlawton
    Andrew, I agree that MapReduce is oriented for only some types of work. That's a great observation, and an article all in its own right. The intent/gist of this article was that given MapReduce/Hadoop are so hot, for those who are using it, it's well worth considering efficiencies at large scales.
  • sean
    From your figure, can I say cloudere is doomed because of its inherent cost level?
  • Althrough I agree 100% that Java is not as fast as C++ I don't think saying that Hadoop wastes 30% efficiency is fair. Google's implementation is not available for outsiders and Hadoop provides improved efficiency over other solutions which are available.
    It's a bit like saying that not owning a Ferrari causes that you to lose efficiency by slowing you down on your way to work. If you can't afford/are not allowed to own a Ferrari than riding a bike is still an advantage compared to walking.
  • kevinlawton
    Robert, this is an article about opportunity as much as inefficiency. Hadoop is open source, and there's no reason it can't be re-created (and improved) in C++ with APIs to access from any kind of apps above. One big improvement while doing so, would be to as a standard, compile programs to LLVM instead of x86. That way they can be re-targeted to whatever generation x86 you have, independently on each server, and take advantage of the features on that server (rather than compiling to a lowest common denominator). That will boost performance even more. These LLVM-to-x86 mappings can be cached of course.

    If programs and libraries are stored this way, it makes for another potential, to create a portable ecosystem of open source and proprietary modules. This would dove-tail with a commercialization effort of a C++ version of Hadoop.
  • CPU-level performance is not very important with jobs usually run using Hadoop or Google's Map Reduce. The idea behind such solutions is to distribute workload efficiently rather than maximize CPU usage. Basically processors can process data faster than read it from disks. Reimplementing Hadoop in faster language won't help with that.
  • I have to agree with Robert on this one. I understand your counterpoint of opportunity, but it is being stretched here. We can stretch your argument even more if we say that implementing a Hadoop clone in pure x86 assembly could potentially be faster than Google's implementation and thus improves our efficiency and lowers our costs. In the end, with todays resources and alternatives, Hadoop is a cost saver.
  • Tom
    I would very much like to hear more about Andrew's point, regarding the solution space that MapReduce can or cannot address. Hadoop is being touted as "the future" and if it's not, do we know what is?
  • Urgh.

    1. Can someone point to me the study that proves that C++ is 30% more efficient than Java?

    2. Even if that were the case (which it's not -- bytecode has been proven to perform better than native code in several cases) by far the biggest bottlenecks in a distributed data processing system are network and disk I/O.

    3. Within Google's MapReduce system, they let you run jobs in other languages including Python and Java.
  • Kevin is spot on in his comments. Inefficiency does scale pretty well and anyone who claims Java is as good at processing a data stream as C/C++ is smoking dope frankly. The best proof is that the most obvious performance critical sections in Hadoop - the compression libraries - are themselves coded in C and invoked via JNI interfaces. The Java compression codecs are unusable (even when they are highly desirable - like the bzip codec). The even bigger elephant in the room is memory - Java sucks memory like crazy. Try building an object cache in C vs. Java and you would know (and yes - this is a relevant data structure in Hadoop - it's extensively used by Hive (written on top of Hadoop) for aggregation operators). Inefficient memory usage means one has to page to disk more often - but as badly - a large memory footprint causes higher CPU utilization (as the cpu stalls for memory requests since the cpu cache less ineffective).

    However - this is the almost logical outcome of the way open source software works. The first implementation with a strong community wins (no matter how inefficient). And development in Java is faster and less buggy. A proprietary software company will have about 50-60% margin at least (over hardware - i am using Netapp as the poster boy here). Open source being free - even if it is inefficient by 30% - it's still a bargain for the user. If you look at the data warehousing market (where Hadoop is a legitimate contender) - companies like Aster charge $50K per terabyte! - whereas the cost of a terabyte in a hadoop cluster is less than $500.

    Bottomline - inefficiency scales pretty well - but inefficiency can be tried to be fixed over time (expect more parts of Hadoop to move to C). But proprietary software costs scale as well and are unfixable. The choice, however unfortunate, is clear.
  • Ahem..looks like some random comparisons without any repeatable benchmarks.

    To think that we had progressed beyond Java/C++ flame wars and into solving actual customer problems...

    Hadoop is about mobile code doing a great job at distributed computation rather than a vast virtual storage device. If you are worried about storage footprint, just buy an SSD and put MySQL on top of it.

    It's also the reason why some of the best Compute Grids in the industry are written in Java. It's not just about beating micro-benchmarks. Case in point - Coherence, GigaSpaces, Gemstone, DataSynapse, GridGain and then Amazon Dynamo, Voldermort, Cassandra.

    If you need dumb caching + storage, use Memcached.

    Cheers!
  • Some people measure the virtualization overhead of zen to be about 5%. VMware could be worse. Linux VServer has almost no overhead at all. Q: Why did Amazon chose XEN? A: Zen is sexy.
blog comments powered by Disqus