The small print: You're going to have to clean up all that big data

Last year, surveyors checked in with more than 2,000 IT leaders in the U.S. and Canada, and 60 percent believed their organizations lacked accountability for data quality, while more than 50 percent questioned the validity of their data.

Recent reports have also uncovered that much of the data collected by the U.S. Department of Education is riddled with errors and missing information.

As the amount and type of raw data sources increases exponentially, data-quality issues can wreak havoc on an organization if the data isn't vetted at all points of the analytics workflow, from ingest to final visualization.

Danger: Look out for dirty data

For example, consider the problems bad data can cause for a typical retailer.

Plenty of data problems can crop up as the retailer gathers information, such as missing product IDs or inaccurate product descriptions. When the product data isn’t standardized, different systems will contain inconsistent information, leading to problems with the retailer’s inventory, fulfillment and logistics.

Inconsistent data about product inventory can lead to overproduction -- resulting in write-downs -- or underproduction, which can cause late deliveries and out-of-stock notices. Bad distribution data can lead to duplicate shipping orders, returns and reshipments. These basic data issues translate to significant wasted time and money for the company.

Because of such risks, organizations need to be smart about how they’re approaching data from the very beginning of the process and each time new data is added.

While companies have been able to monitor the quality of small data sets for some time now, the increasing size and scope of the data organizations deal with on a daily basis has made this task much more complicated. This is where new big data analytics technologies that enable data profiling during every step of the analytics cycle becomes critical in helping organizations to pick out anomalies from enormous data sets from the get-go. This helps them avoid wasting resources due to bad data issues, and also frees up time for businesses to discover additional analytics use cases.

Gauge data quality first

There are plenty of instances of companies using technology to measure the quality of their data early on, ultimately saving resources and reducing problems down the road.

One bank uses a self-service big data analytics tool to identify loans that have high risk and quantify risk exposure. The bank’s analysis identified loans made to borrowers whose credit scores fell below the normal range for the borrower’s zip code (credit scores often correlate closely with zip codes, with more affluent areas tending to have higher-than-average scores). This helped the bank highlight risky loans and better track its loan portfolios’ overall exposure to defaults, which amounted to over $13 million.

A telecommunications company took an entirely different approach to data quality analysis to more accurately plan its spending on new infrastructure. The company analyzed its customer information to find incorrect subscriber data (invalid email addresses, for example) that skewed results on usage in different areas. By correctly correlating subscriber information with network performance data, the company was able to keep up with existing and forecasted demand and by knowing exactly what infrastructure it needed, the company said it was able to avoid wasting an estimated $140 million on unnecessary capital expenditures.

Data quality has become an important, if sometimes overlooked, piece of the big data equation. Until companies rethink their big data analytics workflow and ensure that data quality is considered at every step of the big data analytics process -- from integration all the way through to the final visualization -- the benefits of big data will only be partly realized.

Stefan Groschupf is the chief executive of Datameer. Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which, 10 years later, is considered a $20 billion business.

Danger: Look out for dirty data

Gauge data quality first

More