Two days at VentureBeat’s DataBeat conference this week have convinced me that “big data” is real and potentially transformative — but, like “the cloud” before it, the term is badly overused.
In fact, the ease with which the marketing term gets slung around obscures the complexity and awesomeness of the real work going on in the field.
With the cloud, a simple term — once used by engineers to refer to “stuff on the Internet that is too complicated to explain at this exact moment” — quickly became shorthand for “servers you don’t have to worry about.”
But the simplicity of “cloud” as a sales proposition masked what is, in fact, a major architectural shift in the way companies think about and organize their data centers. That shift is profound, and was brought about by the spread of open Internet standards and open-source software, the wide variation in client devices and browsers, and the potential for massive and unpredictable spikes in traffic. All of these changes have led companies to build enterprise technologies (including their server farms and the applications that run on them) in a way that’s very different from the way they did 10 years ago.
Big data, similarly, is an attractive term because it’s shorthand for “data analysis.” And who doesn’t want data, the bigger the better? The phrase has become so popular that the editors of Merriam Webster’s added it to their dictionary just this week. (That dictionary, by the way, defines big data as “an accumulation of data that is too large and complex for processing by traditional database management tools.”)
But in reality, what’s going on behind that term are a few significant shifts.
The increasingly large quantity of data. Companies have access to, and the ability to collect, far more data than ever before. Sometimes that means tracking every potential customer’s every click across your own website as well as, in some cases, other sites. Sometimes that means understanding how current customers are actually using your product, day in and day out. It could mean collecting data on how people move through a city in order to facilitate better urban planning. Sensors can pick up information and send data into databases at a dizzying rate. All that putting strains on traditional database technologies.
The lack of structure in much of this data. It used to be easy to tell what data was: It was the stuff that you could pigeonhole into specific database fields, like name, address line 1, address line 2, and so on, and then query with SQL statements. Now we have lots of data like this, but we also have enormous amounts of unstructured data: video and audio files, huge amounts of social networking texts, emails, the transcripts of customer support calls, and more. How do you manage data if you don’t even know how to categorize it, or what buckets to put it in? Emerging machine learning technologies, like IBM’s Watson, are one approach for handling such a mess of data as it comes in on the fly.
A shift in the underlying storage technologies. Many companies are starting to move away from data warehouses, storage area networks, and other network storage technologies and toward more distributed, clustered, scalable storage. Hadoop is the poster child of this shift, but it is not the only one. Besides, as it turns out, Hadoop itself has some significant limitations. It can be extremely slow to run jobs in Hadoop, for instance. And it needs better security capabilities.
The ability to get useful information out of this data easily. With the right tools, ordinary, non-data-scientist types have the ability to get meaningful answers out of huge quantities of data. Increasingly, they also have the desire to do this. Most people don’t want to have to learn SQL. They want to look at pretty charts that show them how their business is doing, right now. They want the ability to look at different facets of their data or drill down into details so they can figure out how to make the business run better. This has always been the promise of business intelligence (BI) software, though BI projects have a reputation of getting bogged down in long, drawn-out, incredibly expensive projects that produce less than promised. Maybe today’s visualization and data integration tools will achieve what last decade’s BI tools could not.
Now, not all of this data collection and analysis will be good. The flip side of massive data collection is the potential loss of privacy. Do people really want companies tracking and analyzing their every click and their every movement through the world? It’s clear we need protections to ensure that this kind of data isn’t abused, and that people have a strong, well-defended right to opt out of tracking. This is especially true for kids.
Another potential shortcoming is that companies risk drowning in a sea of data. You can get paralyzed by an inability to make decisions without relying on A/B tests, detailed market analysis charts, or fancy dashboards. Sometimes data is not the answer, and you just need to use your judgment.
Still, the benefits of data are clear, and that’s pushing a wide range of tech companies to come up with new tools for collecting, aggregating, analyzing, and presenting it. This week’s DataBeat conference was just a sampler. There’s a lot more to come.
I’m not sure if the term “big data” will fade away into meaninglessness or if it will become as ubiquitous as the term “cloud.” But I am sure of one thing: Behind the hype, there’s a technological revolution brewing.
VB's research team is studying web-personalization... Chime in here, and we’ll share the results.