5 big data implementation mistakes to avoid

In recent years, few terms have been as overused and misunderstood as "big data." From making predictions about massive flu outbreaks with a Google flu trends solution, to tracking shopping trends and directing savings to customers, to making real-time trading decisions that impact companies’ and individuals’ bottom line positions -- data has become the key to staying competitive in today’s global economy. To understand the industry meaning of big data, and why big data has gotten so much attention, we need to break down the aspects of the database industry that have led to some of the challenges we face when managing and analyzing data today.

To understand big data for the purposes of this article, I will define it as well as I can from the perspective of an executive who helps companies recognize what big data means to them. Big data is simply the current generation of database management requirements and technology needed to meet the demand in the database marketplace. In alignment with Gartner and others today, we hear it defined as: volume, variety, velocity, and complexity when discussing how big data is different.

This data includes complex text, large video and audio files, real-time feeds, and ever-changing business processes that required flexible data schemas from various sources. Problems arose when technologists realized that legacy systems or traditional relational database management systems (RDBMS) solutions weren’t capable of handling or processing the types of data in a way that drove toward real business outcomes. It wasn’t just about storing the information anymore. Technologists and business leaders needed to make better use of available data, to be able to access it, manage, and use it in real time. To meet the new demands, new players arrived on the scene to seemingly solve some of the challenges that occur with the incessant growth of data but created new problems.

So, which mistakes do we see most when organizations try to implement plans to use their big data and fail? A recent survey indicated that more than 75 percent of big data/IT projects in the broader industry were incomplete. Clearly, there are still challenges and obstacles standing in the way of the most effective solutions to tapping into our big data and making it work for us.

Lets break down a few.

1. You aren’t doing enough with the data

Perhaps the most obvious reason for any organization to take on the challenge of big data is the ability to remain competitive by using available data to drive business intelligence that supports decision-making.

If an online publisher has a better understanding of when and why readers are clicking on the content and engaging longer, it can customize content for the current and future visitor demand. Driving value from existing data is one of the most common challenges faced in industry. While many technologies can help meet these challenges, most database technology lacks the ability to quickly and easily do so without a tremendous amount of data transformation, making the goal of accurate business intelligence that much more difficult to reach.

Most database technologies require some sort of data definition or a schema that can slow projects down if some requirements aren't known at the start with respect to data needs. This, by the way, describes almost every IT project I have had the pleasure of working on in the past 15 years.

NoSQL databases solve this problem very effectively. NoSQL databases can be (and often times are) implemented so that schema is unnecessary, or less necessary. This is a primary value proposition for NoSQL databases and a key driver for growing popularity among the players in the NoSQL market.

Complex data modeling, middle-tier object mapping and iterative rework, all which are associated with the older RDBMS model, have opened the door for this "new" way of managing big data.

2. You’ve bet the company on free software

Through the hype cycle of the past few years, every organization thinks it must deploy the latest and greatest solution, like Apache Hadoop or Pig, while feeling that traditional RDBMS solutions are obsolete or outdated. While true that relational databases are inherently incapable of addressing the needs met by NoSQL databases, a growing number of failures in the open source big data ecosystem have prevented the elephant from taking flight -- and many have endured the cost of mongo-sized failure.

The free software movement has largely become a debunked myth shared mainly by inexperienced software developers holding vigil over their version of being the next greatest thing or the only one who can manage it. The industry has spent the past decade coming to grips with the physics of enterprise software (the unabridged version of “You’ll Never Get More Than What You Pay For” -- and don’t forget the sequel “If It Sounds Too Good to Be True...”).

The reality is that most open-source database software is not viable or realistic for solving the needs of the enterprise. Most open-source packages are built to appeal to the web developer for simple consumer-based applications. Those products typically don’t scale well, aren’t secure, and known to lose data. Yes, they lose data because the transaction processor is not designed to verify each autonomous data write.

3. You’ve abandoned your expensive legacy data systems altogether

I believe data warehouses have a long future. This is not such a bold prediction, but what about the future of the RDBMS? Certainly we won’t see the end of the Oracle database anytime soon.

My data shows a growing trend toward the logical data warehouse (LDW): a warehouse that is really built on two or more physical databases integrated into a single access view. For the same reason that industry is adopting NoSQL for application development, it needs a new way to construct and host data warehouses. Using one RDBMS, it’s too hard to get it right the first time -- and it takes too long (and too much money) to do it iteratively.

A LDW uniquely consolidates the indexes and data from almost any data source and makes it possible to build a customized view enabling any client to perform transactions or analytical queries. While RDBMS is becoming old school, the cost of abandoning an existing implementation could be too great. The LDW allows the enterprise to cut its losses with respect to the sunk cost of legacy systems and move onto a more efficient, versatile, and scalable database platform. An enterprise NoSQL database can be the integration point between an old RDMBS and a failing Hadoop project to deal with structured databases, document stores, files, and media. This has tremendous value to wayward IT shops that have struggled with the wrong software in the past.

4. You don’t know your data

As with any industry, an evolution can quickly create a knowledge gap: where our understanding of the challenges and solutions hasn’t caught up with those faced by any specific organization.

Some believe big data has created the need for new roles. Most recently, I’ve seen the emergence of the chief data officer (CDO) and the data scientist. Many have scoffed at the need or cost of bringing new experts into organizations, but organizations without appropriate expertise struggle to understand their own data, what it all means, and the best way to use it. According to Gartner, 25 percent of all large global organizations will have appointed a CDO by 2015.

But quite frankly, you don’t need a data scientist. You need better software.

5. You’ve bitten off more than you can chew

Perhaps one of the easiest mistakes to avoid in your foray into big data is simply taking on too much. Most of the time, this happens because of technology reasons. Strangely, tackling the whole of an enterprise from a big data perspective is nearly impossible, why not start with low-hanging fruit and grow the project quickly with successes. Using flexible technology, like enterprise NoSQL, iterative warehouse development can happen quickly with little to no rework and even less upfront engineering costs.

At a time when companies succeed based on the ability to move quickly and decisively with all available data, pressure is high to increase each corporation’s competitive advantage. Too many organizations take on more than what they can handle successfully. There is a failed notion that all big data issues have to somehow be solved together like one big monolithic problem requiring a single monolithic solution. Leading with the end game in mind, IT managers and chief information officers should be asking what business decision they're trying to affect, rather than how to integrate new technology into existing technology. Asking the right questions can be the success or failure of any data project.

Starting small and scaling fast, once teams are comfortable with the solutions and associated patterns being put forth, will help keep future projects on budget, reach completion in a timely manner, and, most importantly, yield the desired results.

Whether one is dealing with financial data, health care-specific information, shopping analytics, published work, or government intelligence, the only consistencies in data are its ever-changing complexity and variety, as well as its increasing volume and demand. To deal with the massive and continued influx of data in a way that drives business value, organizations need to understand the reasons so many big data projects fail, so those failings can be avoided. Knowing what not to do is just as important as knowing what to do. With this knowledge, organizations can quickly achieve their near- and long-term objectives.

Jon Bakke is executive vice president of worldwide field operations at NoSQL database company MarkLogic.