Hadoop, big data, and the elephant in the room

In the 1800s, John Godfrey Saxe wrote a poem about six blind men and an elephant based on an old Indian story. In an effort to discover what the elephant is, each man touches a different part of the creature and subsequently draws his own unique -- and incorrect -- conclusion about what the beast is. Saxe charitably observes that "each was partly in the right, and all were in the wrong." Fast forward to today, and the elephant may as well have been called Hadoop.

Once again, with Hadoop, we have people trying to describe a puzzling animal. The opinions are as varied -- and sometimes as incorrect -- as they were in the poem. Hadoop has been variously described as the ideal way to do transaction processing, the ideal way to do search, and the ideal way to do analysis, all of which are quite different use cases. If that were not unlikely enough, it is also claimed to be the best way to analyze structured data, semi-structured data and unstructured data. In fact, we are lead to believe that it is everything to everyone. How is this possible? Hadoop is a primitive, undifferentiated technology that can be molded in various ways. In the evolutionary tree, it's far closer to low-level programming languages like C and Java than it is to function-specific programs like database management systems and even higher-level user applications like spreadsheets. When people look at Hadoop and describe it in widely varying ways, they're all correct, because it is like clay that can, in theory, be molded into whatever shape is required. The problem is that they are also wrong in that it really is just a lump of clay. Turning it into something useful requires a lot of skill, time and effort. Hadoop 2.0 has done nothing to change this. Now, I'm not suggesting that there is anything wrong with clay or that Hadoop 2.0 isn't a high-quality version of it. You definitely need lower-level technologies on which to build the higher-level ones. It's just that the current hype seems misplaced. When people praise Frank Lloyd Wright's Fallingwater, how often do they emphasize the chemical composition of the concrete? The important thing about a piece of software is how easy it is to use and apply productively. Like the traditional analytical stack that employs things like data integration, data warehousing, and business intelligence, Hadoop -- and Hadoop 2.0 -- has given us a new stack that is equally as complex and acronym-rich: From HDFS to YARN, from HBase to various flavors of business intelligence. In this new Hadoop world, data still needs to be moved from place to place. Too many layers separate users from their data. Too much time and know-how is required to prepare data. The result: gainfully employed technologists, frustrated business managers, and a lost opportunity to remove the barriers separating business users from insights. The only other happy party in this new world are recruiters, who are able to reap rich rewards for bagging unicorns -- the fabled data scientists who possess a mastery of statistics, PhDs in computer science, and untold experience with Python, Hadoop, MapReduce, JSON and Hive -- literally the stuff of legends. Business managers don’t want to worry about how to take advantage of YARN. They don’t want to learn the meaning of new phrases like Hive, Spark, data reservoirs, data lakes, all of which now populate the tech discourse. They don’t want to have to ask IT to write a query or merge data sets. In short, they don’t want another system with lots of moving parts. They want a simple tool they can use to get answers, as quickly and painlessly as possible. At the end of the day, Hadoop 2.0 remains a framework for programmers. As necessary as low-level technologies are, it’s time we shift our attention -- and ink -- to the end users for whom low-level technologies are as interesting as the wiring inside office walls. Hadoop may ultimately enable a renaissance in the user experience, but it hasn't so far. After years of hype, you can't blame business users who sometimes feel that Hadoop is a white elephant. Let's focus on business user-oriented software -- software that allows people to easily access and analyze unlimited amounts of data from various sources, on their own and without the overhead of the traditional stack. Only then will business users see the full value of their data.

Sandy Steier is chief executive and a co-founder of 1010data. With more than a quarter century of industry experience, Sandy is recognized as an innovator behind the adoption of advanced analytic technologies for big data. Before co-founding 1010data, Sandy was a vice president and manager of research and technology at UBS North America.

More