Data downtime almost doubles as professionals struggle with quality issues, survey finds

Data is critical to every business, but when the volume of information and the complexity of pipelines grow, things are bound to break!

According to a new survey of 200 data professionals working in the U.S., instances of data downtime — periods when enterprise data remains missing, inaccurate or inaccessible — have nearly doubled year over year, given the surge in the number of quality incidents and the firefighting time taken by teams.

The poll, commissioned by data observability company Monte Carlo and conducted by Wakefield Research in March 2023, highlights a critical gap that needs to be addressed as organizations race to pull in as many data assets as they can to build downstream AI and analytics applications for business-critical functions and decision-making.

“More data plus more complexity equals more opportunities for data to break. A higher proportion of data incidents are also being caught as data is becoming more integral to the revenue-generating operations of organizations. This means business users and data consumers are more likely to catch incidents that data teams miss,” Lior Gavish, cofounder and CTO of Monte Carlo, told VentureBeat.

The drivers of data downtime

At the core, the survey attributes the rise in data downtime to three key factors: a growing number of incidents, more time being taken to detect them, and more time being taken to resolve the problems.

Of the 200 respondents, 51% said they witness somewhere between one and 20 data incidents in a typical month; 20% reported 20 to 99 incidents; and 27% said they see at least 100 data incidents every month. This is consistently higher than the figures from last year, with the average number of monthly incidents witnessed by an organization growing to 67 this year from 59 in 2022.

As instances of bad data continue to increase, teams are also taking more time to find and fix the issues. Last year, 62% of the respondents said they typically took four hours or more on average to detect a data incident, while this year that percentage has gone up to 68%.

Similarly, to resolve the incidents after discovery, 63% said they typically take four hours or more — up from 47% last year. Here, the average time to resolution for a data incident has gone from nine hours to 15 hours year over year.

Manual approaches are to blame, not engineers

While it’s pretty easy to blame data engineers for failing to ensure quality and taking too much time to fix things, it's important to understand that the problem is not talent but the task at hand. As Gavish noted, engineers are dealing with not only large quantities of fast-moving data but also constantly changing approaches to how it’s emitted by sources and consumed by the organization — which cannot always be controlled.

“The most common mistake teams are making in that regard is relying exclusively on manual, static data tests. It’s the wrong tool for the job. That type of approach requires your team to anticipate and write a test for all the ways data can go bad in each dataset, which takes a ton of time and doesn’t help with resolution,” he explained.

Instead of these tests, the CTO said, teams should look at automating data quality by deploying machine learning monitors to detect data freshness, volume, schema and distribution issues wherever they happen in the pipeline.

This can give enterprise data analysts a holistic view of data reliability for critical business and data product use cases in near-real time. Plus, as and when something goes wrong, the monitors can send alerts, allowing teams to address the issue quickly, and well before it leaves a significant impact on the business.

Sticking to basics remains important

In addition to ML-driven monitors, teams should stick to certain basics to avoid data downtime, starting with focus and prioritization.

“Data generally follows the Pareto principle: 20% of datasets provide 80% of the business value and 20% of those datasets (not necessarily the same ones) are causing 80% of your data quality issues. Make sure you can identify those high-value and problematic datasets and be aware of when they change over time,” Gavish said.

Further, tactics like creating data SLAs (service level agreements), establishing clear lines of ownership, writing documentation and conducting post-mortems can also come in handy, he added.

Currently, Monte Carlo and Bigeye sit as major players in the fast-maturing AI-driven data observability space. Other players in the category are a bunch of upstarts like Databand, Datafold, Validio, Soda and Acceldata.

That said, it's imperative to note that teams don't necessarily need to rope in a third-party-developed ML observability solution for ensuring quality and reducing data downtime. They can also choose to build in-house if they have the required time and resources. According to the Monte Carlo-Wakefield survey, it takes an average of 112 hours (about two weeks) to develop such a tool in-house.

While the market for specific data observability tools is still developing, Future Market Insights’ research suggests that the broader observability platform market is expected to grow from $2.17 billion in 2022 to $5.55 billion by 2032, with a CAGR of 8.2%.