What is dirty data? Sources, impact, key strategies

Enterprise data is critical to business success. Companies around the world understand this and leverage platforms such as Snowflake to make the most of information streaming in from various sources. However, more often than not, this data can become "dirty." In essence, it could, at any stage of the pipeline, lose key attributes such as accuracy, accessibility and completeness (among others), becoming unsuitable for downstream use initially targeted by the organization.

“Some data can be objectively wrong. Data fields can be left blank, misspelled or inaccurate names, addresses, phone numbers can be provided and duplicate information…are some examples. However, whether that data can be classed as dirty very much depends on context.

For example, a missing or incorrect email address is not required to complete a retail store sale, but a marketing team who wishes to contact customers via email to send promotional information will classify that same data as dirty,” Jason Medd, research director at Gartner, told VentureBeat.

In addition, the untimely and inconsistent flow of information can also add to the problem of dirty data within an organization. The latter particularly occurs in the case of merging information from two or more systems using different standards. For instance, if one system classifies names as a single field while the other divides them into two, only one will be considered valid, with the other requiring cleansing.

Sources of dirty data

Overall, the entire issue boils down to five key sources:

People

As Medd explained, dirty data can occur due to human errors upon entry. This could be an outcome of shoddy work from the person entering the data, the lack of training or poorly defined roles and responsibilities. Many organizations do not even consider establishing a data-focused collaborative culture

Processes

Process oversight can also lead to cases of dirty data. For instance, poorly defined data lifecycles could lead to the use of outdated information across systems (people change numbers, addresses over time). There could also be issues due to the lack of data quality firewalls for critical data capture points or the lack of clear cross-functional data processes.

Technology

Technology glitches such as programming errors or poorly maintained internal/external interfaces can affect data quality and consistency. Many organizations can even miss out on deploying data quality tools or end up keeping multiple varying copies of the same data due to system fragmentation.

Organization

Among other things, activities at the broader organization level, such as acquisitions and mergers, can also disrupt data practices. This issue is particularly common in large enterprises. Not to mention, due to the complexity of such organizations, the head of many functional areas could resort to keeping and managing data in silos.

Governance

Gaps in governance, which ensures authority and control over data assets, could be another reason for quality issues. Organizations failing to set data entry standards, appointing data owners/stewards or placing broken policies for scale, pace and distribution of data could end up with botched first and third-party data.

“Data governance is the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption and control of data. It also defines a policy management framework to ensure data quality throughout the business value chains. Managing dirty data is not simply a technology problem. It requires the application and coordination of people, processes and technology. Data governance is a key pillar to not just identifying dirty data, but also for ensuring issues are remediated and monitored on an ongoing basis,” Medd added.

Enterprise-wide impact

Whatever the source, data quality issues can have a significant impact on downstream analytics, resulting in poor business decisions, inefficiencies, missed opportunities and reputational damage. There can also be smaller problems such as sending the same communication message multiple times to a customer whose name was recorded differently in the same system.

All this eventually translates into additional costs, attrition, bad customer experiences. In fact, Medd pointed out that poor data quality can cost organizations an average of $12.9 million every year. Stewart Bond, the director of data integration and intelligence research at IDC, also shared the same opinion, noting that his organization’s recent data trust survey found that low levels of data quality and trust impact operational costs the most.

Key measures to tackle data quality challenges

In order to keep the data pipeline clean, organizations should set up a scalable and comprehensive data quality program covering the tactical data quality problems as well as strategic aspects of the alignment of resources and business objectives. This, as Medd explained, can be done by building a strong foundation bolstered by modern technology, metrics, processes, policies, roles and responsibilities.

“Organizations have typically solved data quality problems as point solutions in individual business units, where the problems are manifested most. This could be a good starting point for a data quality initiative. However, the solutions frequently focus on specific use cases and often overlook the broader business context, which may involve other business units. It’s critical for organizations to have scalable data quality programs so that they can build on their successes in experience and skills,” Medd said.

In a nutshell, a data quality program has to have six main layers:

Definition

As part of this, the organization has to define the broader goal of the program, detailing what data they plan to keep under the scanner, which business processes can lead to the bad data (and how) and which departments’ can ultimately be impacted by that data. Based on this information, the organization could then define data rules and appoint data owners and stewards for accountability.

A good example could be the case of customer records. An organization with the goal to ensure unique and accurate customer records for use by marketing teams can have rules like all addresses and names gathered from fresh orders should be unique when put together or the addresses should be verified against an authorized database.

Assessment

Once the rules are defined, the organization has to use them to check new (at source) and existing data records for key quality attributes, starting from accuracy and completeness to consistency and timeliness. The process usually involves leveraging qualitative/quantitative tools, as most enterprises deal with a large variety and volume of information from different systems.

“There are many data quality solutions available in the market, that range from domain-specific (customers, addresses, products, locations, etc.) to software that finds bad data based on the rules that define what good data is. There is also an emerging set of software vendors that are using data science and machine learning techniques to find anomalies in data as possible data quality issues. The first line of defense though is having data standards in place for data entry,” IDC’s Bond told Venturebeat.

Analysis

Following the assessment, the results have to be analyzed. At this stage, the team responsible for the data has to understand the quality gaps (if any) and determine the root cause of the problems (faulty entry, duplication or anything else). This shows how far off the current data is from the original goal targeted by the organization and what needs to be done moving ahead.

Cleanup

With the root cause in sight, the organization has to develop and implement plans for solving the problem at hand. This should include steps to correct the issue as well as policy, technology or process-related changes to make sure that the problem doesn’t occur again. Note here that the steps should be executed by taking resources and costs into account, and some changes might take longer to be implemented than others.

Control

Finally, the organization has to ensure that the changes remain in effect and the data quality is in line with the data rules. The information around the current standards and status of the data should be promoted across the organization, cultivating a collaborative culture to ensure data quality on an ongoing basis.