Why Hadoop projects fail -- and how to make yours a success

This is a guest post by Dell Software executive Guy Harrison

Without doubt, "big data" is the hottest topic in enterprise IT since cloud computing came to prominence five years ago. And the most concrete technology behind the big data trend is Hadoop.

Most enterprises are at least experimenting with Hadoop, and the potential for transformative business improvement is real. But just as real is the chance of what I call a “Hadoop hangover” if the project fails to meet expectations and instead results in costly failure.

To help you make the most of Hadoop, let's look at the promise of big data analytics, and how to avoid expensive, disillusioning failure.

Getting from big data to smart algorithms

For most businesses, big data is an attempt to emulate the advanced data-driven business techniques that propelled Amazon and Google to the forefront of their respective industries.

This is not business intelligence as we have known it in the past: the primary aim is not to facilitate executive decision making through charts and reports, but to entwine data-driven algorithms directly into the business processes that drive customer experience.

Hadoop -- essentially an open source implementation of core Google technologies -- is the most concrete technology behind big data. Hadoop enables big data projects by providing an economic way to store and process masses of raw data. Hadoop has been proven at scale at Facebook and Yahoo, and was the basis of the most impressive artificial intelligence project to date: IBM’s Watson, the super-computer that won Jeapordy! in 2011.

Most - if not all- Fortune 500 companies have at least a Hadoop pilot project in place. Many are still in the initial data capture stage: setting up the workflows to capture raw business data, demographics and the “data exhaust” flowing from websites and social media. These data capture projects entail significant risk in their own right.

Of course, collecting the data is only the beginning. There’s an old adage: “data is now knowledge and knowledge isn’t information” -- and this remains true even if you have “big” data. Indeed, we might add a new clause for our big data world: “information isn’t action”. In other words, determining the meaning of the data is no longer enough: we have to establish the mechanisms -- implemented as complex adaptive algorithms -- that drive a more effective business.

It’s a tenant of big data analytics that the more data you have, the less complex your algorithms need be. It’s the difference between predicting the outcome of an election from a polling sample and counting the votes on election night. The election night count is always more accurate.

Furthermore, machine learning techniques allow algorithms to be “trained” from the data itself. Essentially the data drives and refines the algorithms.

So having lots of data is an advantage. But, at the end of the day, it still requires a lot of human intelligence to come up with the best answers. Indeed, sometimes it’s a matter of asking the right question. Collecting the data is necessary but not sufficient. Getting from big data to smart algorithms is a unique challenge in its own right.

With all that in mind, let’s look at the key challenges facing successful Big Data analytic projects:

Data scientists are critical, but in short supply

The Googles and Amazons of this world succeeded in their big data projects largely because they were able to attract and retain some of the world’s most gifted computer scientists. These were individuals who brought to the table not just programming skills; they were also able to bring to bear complex statistical analysis techniques, business insight, cognitive psychology and incredible innovative problem solving abilities.

We’ve come to call these types of people “data scientists” and it’s well understood that the base skills -- statistics, algorithms, parallel programming, and so on are in short supply. Academia is only just responding with curricula to produce suitably qualified graduates. It will be years before we see a significant increase in qualified data scientists.

If and when we see the supply of data scientists increase, we will still be faced with a more fundamental issue. This stuff is hard. It requires the ability to think across at least three fundamentally complex specializations, including competitive business strategy, machine learning algorithms, and massively parallel data programming. This unique combination of skills is likely to be the limiting factor for big data in the enterprise for the foreseeable future.

At the core of any big data project is the data scientist --- acquiring or developing data science capability is a critical factor in a big data project.

The shortage of big data tools

Compounding the problem of the data science talent gap -- but perhaps also offering a possible solution --is the lack of suitable tools for the data scientist.

Hadoop and other data stores supply a brute force engine for computation and data storage. Hadoop clusters can consist of potentially thousands of commodity servers -- each with their own disk storage and CPUs. Data is stored redundantly across nodes in the cluster. The MapReduce algorithm allows processing to be distributed across all the nodes in the cluster. The result is an amazingly cost effective way of distributing processing across potentially thousands of CPUs disks.

But programming in MapReduce is akin to programming in Assembly language - it’s not a practical way of creating big data algorithms. To turn big data into big value, the data scientist needs tools that can support statistical hypothesis testing, creating and training predictive models, as well as reporting and visualization. Open source projects such as Mahout, Weka and R provide a starting point, but none are easy to use, and often they are insufficiently scalable or otherwise unsuitable to be at the core of Big Data enterprise solutions.

Higher level toolkits – which might leverage Mahout, R and the like, but which make them accessible to a wider audience and allow them to be used as building blocks in more complex workflows – are the next stage of evolution for data science products. Without these Big Data analytic platforms, fully leveraging big data will only be possible in the largest enterprises, who have the budget and reputation sufficient to attract the limited supply of truly capable data scientists.

Data scientists need a more effective analysis framework and toolkit than is provided by Hadoop and its ecosystem. Producing these tools should be a priority for the software community.

The reduction in data quality

Hadoop succeeds as the basis for so many big data projects not just because it can economically store and process large quantities of data, but also because it can accept data in any form. In a traditional database, data must be converted to a pre-defined structure (a schema) before being loaded.

These ETL (Extract-Transform-Load) projects are typically expensive and time consuming. Furthermore, the economics of data warehousing typically required that the data be aggregated and pruned before loading, and therefore lost the granularity necessary for big data solutions.

Hadoop allows for “schema on read” -- you need only define the structure of the data when you come to read it. This allows data to be loaded in its most raw form, without needing to analyze or define the data ahead of time. You load everything at low cost, and then only “pay” for the schemas you need.

However, this approach has some fairly obvious risks -- machine-generated data in particular might be changing structure rapidly and by the time you come to mine the data it might be very hard to determine its structure. Furthermore, any errors in the generated data might not be picked up until it is too late.

So despite the promise of schema on read, success in a big data project may depend on careful vetting of incoming data -- not to the extent of a full ETL process to be sure, but more than simply “load and hope”. After all, one of the first lessons of the computer age was GIGO: Garbage In, Garbage Out.

Pay attention to the quality and format of data streaming into Hadoop. Make sure you’ve identified the structure and assured the quality of that data.

Hadoop has proven it’s scalability at places like Yahoo and Facebook, and proven an ability to power the most complex analytics as the basis for IBM’s Watson AI. However, it misses some key features that the enterprise regards as important:

Security in Hadoop is weak. Once authenticated to a Hadoop cluster, a user can typically access all the data in that cluster. Although it’s possible to limit a user’s access to specific files in a Hadoop cluster it’s not possible to limit data to individual records in that file. Furthermore, because of the cumbersome nature of Hadoop security and the interaction with external tools such as Hive (Hadoop’s native SQL interface) the most common practice is to allow everybody access to everything.
Backup is also difficult. Hadoop is inherently fault tolerant, but enterprises still want to have a disaster recovery plan, or to restore to a point in time backup should some human error result in data corruption. Most distributions do not have these capabilities (the MapR distribution does provide a snapshot capability).
Integration with enterprise monitoring systems is lacking. Hadoop generates metrics, and each Hadoop vendor offers an “Enterprise” console, but these do not integrate properly with Enterprise monitoring systems such as Openview or Foglight.
Resource management is primitive. The ability to manage resources to prevent adhoc requests from blocking mission critical operations is only just emerging.
Real-time query is not a feature of Hadoop. While an emerging set of SQL-based languages and caching layers have been created, Hadoop is not a suitable basis for real time computing.

None of these issues are show stoppers for Hadoop, but failure to acknowledge these limitations may lead to unrealistic expectations for your Hadoop project that cannot be fulfilled.

Make sure you understand the technical strengths and limitations of Hadoop. Avoid unrealistic expectations for your Hadoop solution.

Organizational challenges

Big data is a complex and potentially disruptive challenge to many organizations. Globalization and e-commerce have flattened the world so much that for many businesses simply competing on price or store locality is no longer an option. Competitive differentiation will derive increasingly from personalization, targeting, predictive recommendations and so on. For many businesses, achieving some form of data-driven operation will be survival itself.

History has shown that when faced with this sort of disruptive threat, many companies “freeze” – clinging ever tighter to outmoded business models and hoping for a return to the competitive landscape of the past.

Big data analytics is an over-hyped, poorly-defined and over-used term. Despite that, and despite the challenges outlined above, I believe that for many businesses, the opportunities presented by the big data revolution are as significant and fundamental as those presented by e-commerce 15 years ago. Companies (particularly retailers) should be bold and determined in reacting to these challenges.

Organizational resistance and scepticism to big data is understandable. But don’t let big data risks blind you to the benefits -– and sometimes necessity -- of a big data project. Indeed, drinking sensibly seems to be the best way to avoid the hangover without missing the party altogether.

Guy Harrison is an Executive Director of Research and Development for Dell Software. He is the author of several books and articles on database and data management, and writes a monthly column for Database Trends and Applications.

Guy’s work can be found online, and he can be reached by e-mail at guy.harrison@software.dell.com or on Twitter at @guyharrison.