Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
The year is 1999 and the internet has begun to hit its stride. Near the top of the list of its most trafficked sites, eBay suffers an outage — considered to be the first high-profile instance of downtime in the history of the world wide web as we know it today.
At the time, CNN described eBay’s response to the outage this way: “The company said on its site that its technical staff continues to work on the problem and that the ‘entire process may still take a few hours yet.’”
It almost sounds like a few folks in a server room pushing buttons until the site comes back online, doesn’t it?
Now, nearly 25 years later and in a wildly complex digital landscape with increasingly complex software powering business at the highest of stakes, companies rely on software engineering teams to track, resolve — and most importantly prevent — downtime issues. They do this by investing heavily in observability solutions like Datadog, New Relic, AppDynamics and others.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
Why? In addition to the engineering resources it takes to respond to a downtime incident, not to mention the trust that is lost among the company’s customers and stakeholders, the economic impact of a downtime incident can be financially catastrophic.
Preventing data downtime
As we turn the page on another year in this massive digital evolution, we see the world of data analytics primed to experience a similar journey. And just as application downtime became the job of massive teams of software engineers to tackle with application observability solutions, so too will it be the job of data teams to track, resolve, and prevent instances of data downtime.
Data downtime refers to periods of time where data is missing, inaccurate or otherwise “bad,” and can cost companies millions of dollars per year in lost productivity, misused people hours and eroded customer trust.
While there are plenty of commonalities between application observability and data observability, there are clear differences, too — including use cases, personas and other key nuances. Let’s dive in.
What is application observability?
Application observability refers to the end-to-end understanding of application health across a software environment to prevent application downtime.
Application observability use cases
Common use cases include detection, alerting, incident management, root cause analysis, impact analysis and resolution of application downtime. In other words, measurements taken to improve the reliability of software applications over time, and to make it easier and more streamlined to resolve software performance issues when they arise.
The key personas leveraging and building application observability solutions include software engineer, infrastructure administrator, observability engineer, site reliability engineer and DevOps engineer.
Companies with lean teams or relatively simple software environments will often employ one or a few software engineers whose responsibility it is to obtain and operate an application observability solution. As companies grow, both in team size and in application complexity, observability is often delegated to more specialized roles like observability managers, site reliability engineers or application product managers.
Application observability responsibilities
Application observability solutions monitor across three key pillars:
- Metrics: A numeric representation of data measured over intervals of time. Metrics can harness the power of mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time in the present and future.
- Traces: A representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system. Traces are a representation of logs; the data structure of traces looks almost like that of an event log.
- Logs: An immutable, timestamped record of discrete events that happened over time.
High-quality application observability possesses the following characteristics that help companies ensure the health of their most critical applications:
- End-to-end coverage across applications (particularly important for microservice architectures).
- Fully automated, out-of-the-box integration with existing components of your tech stack — no manual inputs needed.
- Real-time data capture through metrics, traces and logs.
- Traceability/lineage to highlight relationships between dependencies and where issues occur for quick resolution.
What is data observability?
Like application observability, data observability also tackles system reliability but of a slightly different variety: analytical data.
Data observability is an organization’s ability to fully understand the health of the data in its systems. Tools use automated monitoring, automated root cause analysis, data lineage and data health insights to detect, resolve and prevent data anomalies. This leads to healthier pipelines, more productive teams and happier customers.
Common use cases for data observability include detection, alerting, incident management, root cause analysis, impact analysis and resolution of data downtime.
At the end of the day, data reliability is everyone’s problem, and data quality is a responsibility shared by multiple people on the data team. Smaller companies may have one or a few individuals who maintain data observability solutions; however, as companies grow both in size and quantity of ingested data, the following more specialized personas tend to be the tactical managers of data pipeline and system reliability.
- Data engineer: Works closely with analysts to help them tell stories about that data through business intelligence visualizations or other frameworks. Data designers are more common in larger organizations and often come from product design backgrounds.
- Data product manager: Responsible for managing the life cycle of a given data product and is often in charge of managing cross-functional stakeholders, product road maps and other strategic tasks.
- Analytics engineer: Sits between a data engineer and analysts and is responsible for transforming and modeling the data such that stakeholders are empowered to trust and use that data.
- Data reliability engineer: Dedicated to building more resilient data stacks through data observability, testing and other common approaches.
Data observability solutions monitor across five key pillars:
- Freshness: Seeks to understand how up-to-date data tables are, as well as the cadence at which they are updated.
- Distribution: In other words, a function of data’s possible values and if data is within an accepted range.
- Volume: Refers to the completeness of data tables and offers insights on the health of data sources.
- Schema: Changes in the organization of your data often indicate broken data.
- Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it.
High-quality data observability solutions possess the following characteristics that help companies ensure the health, quality and reliability of their data and reduce data downtime:
- The data observability platform connects to an existing stack quickly and seamlessly and does not require modifying data pipelines, writing new code or using a particular programming language.
- Monitors data at rest and does not require extracting data from where it is currently stored.
- Requires minimal configuration and practically no threshold-setting. Data observability tools should use machine learning (ML) models to automatically learn an environment and its data.
- Requires no prior mapping of what needs to be monitored and in what way. Helps identify key resources, key dependencies and key invariants to provide broad data observability with little effort.
- Provides rich context that enables rapid triage, troubleshooting and effective communication with stakeholders impacted by data reliability issues.
The future of data and application observability
Since the Internet became truly mainstream in the late 1990s, we’ve seen the rise in importance, and the corresponding technological advances, in application observability to minimize downtime and improve trust in software.
More recently, we’ve seen a similar boom in the importance and growth of data observability as companies put more and more of a premium on trustworthy, reliable data. Just as organizations were quick to realize the impact of application downtime a few decades ago, companies are coming to understand the business impact that analytical data downtime incidents can have, not only on their public image, but also on their bottom line.
For instance, a May 2022 data downtime incident involving the gaming software company Unity Technologies sank its stock by 36% percent when bad data had caused its advertising monetization tool to lose the company upwards of $110 million in lost revenue.
I predict that this same sense of urgency around observability will continue to expand to other areas of tech, such as ML and security. In the meantime, the more we know about system performance across all axes, the better — particularly in this macroeconomic climate.
After all, with more visibility comes more trust. And with more trust comes happier customers.
Lior Gavish is CTO and cofounder of Monte Carlo.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!