Big data's little secret: Hadoop isn't the end-all-be-all

This is a guest post by enterprise technology executive Jeff Carr

"Big data" is without a doubt the hottest trend in technology today, possibly surpassing social media, which has held the tech hype crown for years. At its broadest, the definition of big data includes any aspect of harnessing, analyzing and monetizing the massive amounts of data being generated by web and mobile based applications. The sheer scale of the data being generated dwarfs what was considered ‘large’ amounts of data as recently as ten years ago, and all indications are that this trend line will continue.

Most observers would agree that the era of big data started around 2007, when Google’s MapReduce programming framework was integrated with Apache Hadoop, an open source project founded a couple of years earlier to help developers efficiently and cheaply process large amounts of data. Used together, Hadoop and MapReduce made it faster, easier and cheaper to process and analyze massive volumes ofdata than ever before.

At this juncture companies started adopting various forms of Hadoop/MapReduce to capture and filter their data. Companies like Yahoo and later Facebook were some of the earliest to announce petabyte stores of data in Hadoop.

Rapid commercialization of the Hadoop ecosystem, however, has only occurred in the last two or three years, as the revenue opportunity began to reveal itself. As with any big trend in technology, including the RDBMS/client server, internet, and web security trends that preceded it, big data has correspondingly evolved into the technological equivalent of a gold rush. Hundreds of companies have entered the fray with hopes to quickly cash in.

The majority of these companies, which include pre-big data enterprise technology incumbents and a slew of data-focused technology startups, are positioning themselves as the suppliers to the miners of big data. Instead of picks, axes and gold pans, they supply the tools, technologies and services that will help companies monetize huge amounts of data. Needless to say, there is a lot of data to be mined and a lot of money to be made.

A slightly closer look at the big data market demonstrates an obvious, yet often overlooked, truth about where we are in the big data innovation and maturity cycle. Most big data products available today are UI’s, management tools and integration tools focused around open source projects being developed within the Apache Hadoop ecosystem, including Hive, PIG and Zookeeper. Big data revenue, on the other hand, is being driven primarily from services that help design, architect and implement big data solutions using the Hadoop ecosystem. In fact, many of the largest and fastest growing companies in big data today are pure services companies (Opera Solutions and Think Big Analytics come to mind).

I’m not suggesting that there is anything wrong with a services approach to the market. Most of these companies are providing solid value for their customers, which can translate into healthy revenue streams. It does help, however, to have some historical perspective to understand where we are in the big data innovation cycle, and what comes next.

Open source services and support as a primary revenue stream originated in the 90’s when there was a similar gold rush around the commercialization of Linux. Early companies such as VA Linux and Red Hat capitalized on this. Nearly 20 years after the open source movement started, however, there is exactly one company with “pure” open source roots that has more than $1B in annual revenue, and they achieved this in 2012. By contrast, there are many billion dollar technology companies that have innovated new IP-based solutions that monetize major technology trends.

In that sense, it’s clear that we remain in the earliest days of the big data movement. Larger companies looking to monetize their big data assets lack the expertise and know-how to do so, and they are turning to services companies to help them bridge those gaps. As the market evolves, so too will labor skills democratize and broader product innovation begin to take hold, creating less reliance on services-centric companies. To relate this back to my gold rush analogy, the early winners were people selling tools and mining expertise, but the long term winners were the people that actually found the gold!

So what’s next for big data? For any developer team that has felt the pain of building a big data infrastructure, one clear next step is simplification. The diagram below is a basic flow most companies go through to leverage big data:

Building this solution requires a small army of vendors and consultants to combine solutions and technologies in various ways to analyze and (hopefully) monetize their data. It takes months, and in some cases, years. It’s expensive. In short, it’s a pain in the butt, and the result often does not help monetize big data directly, it’s just the first step in the process.

Yet, no one can argue that this is not where the “action” is in big data today.

Simply put, the current state of big data is great for service vendors, and not always so great for big data buyers. It’s a market ripe for innovation. In the immediate future, we can expect an increasing number of product-centric companies to begin to disrupt the patchwork of services-centric solutions that currently exist.

In summary, the “secret” of Big Data is that today it suffers from a dearth of expertise, so the majority of the revenue is coming from a services centric approach combined with open source technologies. I am in no way diminishing the importance and value of open source projects like Hadoop. To the contrary, I’m a huge supporter, always have been.

What I’m pointing out is that the market will evolve beyond an open source, services-driven revenue model when companies begin developing highly disruptive technologies that solve the hardest problems of big data. While open source solutions like those from Apache may play a role in this, history indicates that the most innovation will come from companies engineering entirely new ways to solve the most difficult problems.

Jeff Carr is COO of Precog. Precog is a data science platform designed for developers and data scientists to turn data assets into data-driven features and products inside an application.

Jeff has worked in technology for 25 years with a focus on business development, market assessment, strategy and operations. For the past 11 years he has worked exclusively with early stage companies in markets including network security (Vericept, CipherTrust), VOIP (Borderware SIPassure), and Big Data (Precog).

More