When we deployed the first production Hadoop cluster in 2006, we were looking to build a more efficient and cost-effective web search index at Yahoo. Very quickly, other groups at Yahoo started using Hadoop for research jobs and revenue-driving applications, such as click prediction for sponsored search. Since then, Hadoop has evolved to become an essential tool for many of the world’s largest companies, including Facebook, GE, Visa, and Wal-Mart.

But as adoption has grown and companies increasingly rely on Hadoop, a few shortcomings have limited its potential business value. As we built Pepperdata, we worked with many of the largest users to understand how they use their clusters and what needed improving. We found three pain points cause the most trouble:

1. Mixed workloads and multi-tenancy environments cause jobs to fight for resources.

While Hadoop schedulers have improved over the years, they are still based on pre-allocating resources when a job starts. The problem is that jobs use a varying mix of different hardware resources during the course of their lifetime. In addition, some hardware resources (such as disk I/O and network) aren’t limited in standard Hadoop. Both of these factors lead to competition for resources that ideally should be arbitrated at runtime, resulting in work not being completed in time or at all.

2. Troubleshooting is difficult and can take hours.

Although there are a multitude of tools that allow users to monitor their clusters, administrators are often left with an incomplete view of the factors affecting cluster health. We’ve seen that the lack of granular tools makes it difficult to isolate the root cause of problems and drives a lot of inefficient behavior, such as guess-and-check restarting and asking users about jobs they submitted. As cluster size grows and businesses increasingly rely on Hadoop, such methods will become (and for many have already become) unsustainable.

3. Buying more hardware than needed.

To compensate for their lack of control over cluster resources, organizations usually size their clusters based on anticipated peak loads. The goal is to ensure that jobs don’t overload the cluster and lead to massively degraded performance, job failures, or worse. However, because of Hadoop’s inefficient, up-front allocation of resources, this strategy is expensive and leaves capacity unused much of the time – and can still fail to prevent undesired outcomes as workloads are often unpredictable.

Workarounds vs. solutions

In our research, we found that many administrators of large-scale production clusters implement their own workarounds. We suggest administrators who are experiencing pain consider the following, bearing in mind that these are mitigations, rather than wholesale solutions:

  • Deploying separate clusters to isolate production from R&D and groups from each other. While this provides a high degree of separation for important jobs, it is highly inefficient and can cause challenges in keeping data in sync.
  • Extensively optimizing Hadoop via existing tuning methods to tweak specific jobs, slots, etc. While tuning is a natural part of deployment, current options are still limited and don’t address the fundamental limitation of up-front allocation and ignore critical factors such as disk I/O and network utilization.
  • Building business processes to reduce the risk of surprises – e.g., one company has a Hadoop review committee to approve all production jobs prior to submission. Such an approach acts as a gating mechanism, slowing users, reducing innovation, and decreasing the usefulness of the cluster.
  • Orienting processes to reduce the impact of delays. This strategy treats Hadoop as an unreliable system and assumes any given job might not deliver timely outputs, severely limiting the positive business impact companies can get from Hadoop.
  • If all else fails, administrators can log into a node when a problem occurs to attempt to catch the offending job in the act, which requires a high degree of human intervention and doesn’t always work.

Again, these suggestions are workarounds for Hadoop’s limitations rather than solutions, and most come with their own tradeoffs. Therefore, you should be aware that business-critical applications may need significantly more attention than merely setting up out-of-the-box Hadoop.

Hadoop’s evolution

The adoption of YARN is increasing the flexibility and efficiency of Hadoop clusters and bringing whole new classes of users to the platform. While we (and the rest of the industry) are excited about the wide scale adoption of YARN, it still suffers from some of the fundamental problems of Hadoop 1, like up-front allocation of resources and imperfect visibility, and thus represents an evolution, not a magic bullet.

Hadoop is a fantastic platform, and there’s no surprise that it has been so widely adopted. Over the past eight years, we’ve seen tremendous growth in its flexibility, scalability, and reliability. At the same time, some of the fundamental challenges we faced with Hadoop in the early days at Yahoo, such as its unpredictability in multi-tenant clusters, continue to be an issue. As Hadoop continues to evolve, however, we will not only overcome these challenges, but unleash its true business value.

Sean Suchter, founder and chief executive of Pepperdata.