We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
Imagine, for a moment, that you lead a customer success operations team that is responsible for compiling a weekly report for the CEO outlining data on customer churn and analytics.
Over and over, you deliver the report only to be notified minutes later about problems with the data. It doesn’t matter how strong the ETL pipelines are or how many times the team reviews the SQL queries — the data are just not reliable. This puts you in the awkward position of repeatedly coming back to leadership telling them that the information you just provided was wrong. These interactions erode the CEO’s trust not only in the data but also in the conclusions you draw from it. Something has to change.
In today’s business landscape, many companies manage petabytes of data. This is a larger volume than most humans can comprehend — let alone manage — without a methodology for thinking about dataset health.
Observability is a familiar concept
So how do you think about managing the health of such large datasets? Think about a car. A car is a complex system, and the actions you would take to deal with a flat tire are different from ones for engine trouble. Fortunately, you don’t need to inspect the entire vehicle every time it breaks down. Instead, you rely on tire pressure or check-engine lights to warn you — usually in advance of serious consequences — not only that an issue exists but also what part of the car is affected. This kind of automatic surfacing of problems is called observability.
In software engineering, this concept exists up and down the stack. In DevOps, for example, an alert and an easily consumable dashboard give the engineer a head start on fixing a problem. Companies like New Relic, DataDog, and Dynatrace help software engineers quickly get to the root of issues in complex software systems. This is infrastructure observability. Up the stack, in the AI and machine learning model layer, other companies provide observability to machine learning engineers on how their production models perform in ever-changing environments. This is machine learning observability.
So what infrastructure observability does for software and machine learning observability does for machine learning models, data observability does for dataset health management. These disciplines work in concert, and often you need to rely on more than one of them to solve a problem.
What is data observability?
Data observability is the discipline of automatically surfacing the health of your data and repairing any issues as quickly as possible.
It is a fast-maturing area with major players like Monte Carlo and Bigeye as well as a coterie of upstarts like Acceldata, Databand, and Soda. The software infrastructure observability market, which is more mature than the data observability market, was estimated to be worth over $5 billion in 2020 and has likely grown significantly since. While the data observability market is not as well-developed at this point, it has plenty of room to grow since it caters to different personas (data engineers versus software engineers) and solves different problems (datasets versus web applications). In all, companies focused on data observability have collectively raised over $250 million to date.
Why enterprises need to care
Today, every company is a data company. This can take on many forms, from a technology company collecting user data to better recommend content to a manufacturing company maintaining large internal datasets on safety systems to a finance company making major investment decisions based on data from third-party providers. Today’s technology trends, from digital transformation to the shift to cloud compute and data storage, only serve to amplify this influence of data.
Given organizations’ heavy reliance on data, any problems with that data can permeate deep into the enterprise, impacting customer service, marketing, operations, sales, and ultimately revenue. When data powers automated systems or mission-critical decisions, the stakes can multiply.
If data is the new oil, it is critical to monitor and maintain the integrity of this precious resource. Just like most of us would not tape over the check-engine light, we need to pay attention to data observability practices together with infrastructure and AI observability for the businesses that rely heavily on those areas.
As datasets become bigger and data systems become more complex, data observability will be a critical tool for realizing maximum business value and sustainability.
Aparna Dhinakaran is Cofounder and CPO at machine learning observability provider Arize AI. She was recently named to the 2022 Forbes 30 Under 30 in Enterprise Technology and is a member of the Cognitive World think tank on enterprise AI.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!