Data warehouses and lakes will merge

My first prediction relates to the foundation of modern data systems: the storage layer. For decades, data warehouses and lakes have enabled companies to store (and sometimes process) large volumes of operational and analytical data. While a warehouse stores data in a structured state, via schemas and tables, lakes primarily store unstructured data.

However, as technologies mature and companies seek to “win” the data storage wars, companies like AWS, Snowflake, Google and Databricks are developing solutions that marry the best of both worlds, blurring the boundaries between data warehouse and data lake architectures. Additionally, more and more businesses are adopting both warehouses and lakes — either as one solution or a patchwork of several.

Primarily to keep up with the competition, major warehouse and lake providers are developing new functionalities that bring either solution closer to parity with the other. While data warehouse software expands to cover data science and machine learning use cases, lake companies are building out tooling to help data teams make more sense out of raw data.

But what does this mean for data quality? In our opinion, this convergence of technologies is ultimately good news. Kind of.

On the one hand, a way to better operationalize data with fewer tools means there are — in theory — fewer opportunities for data to break in production. The lakehouse demands greater standardization of how data platforms work, and therefore opens the door for a more centralized approach to data quality and observability. Frameworks like ACID (Atomicity, Consistency, Isolation, Durability) and Delta Lake make managing data contracts and change management much more manageable at scale.

We predict that this convergence will be good for consumers (both financially and in terms of resource management), but will also likely introduce additional complexity to your data pipelines.

Emergence of new roles on the data team

In 2012, the Harvard Business Review named “data scientist” the sexiest job of the 21st century. Shortly thereafter, in 2015, DJ Patil, a PhD and former data science lead at LinkedIn, was hired as the United States’ first-ever Chief Data Scientist. And in 2017, Apache Airflow creator Maxime Beauchemin predicted the “downfall of the data engineer” in a canonical blog post.

Long gone are the days of siloed database administrators or analysts. Data is emerging as its own company-wide organization with bespoke roles like data scientists, analysts and engineers. In the coming years, we predict even more specializations will emerge to handle the ingestion, cleaning, transformation, translation, analysis, productization and reliability of data.

This wave of specialization is not unique to data, of course. Specialization is common to nearly every industry and signals a market maturity indicative of the need for scale, improved speed and heightened performance.

The roles we predict will come to dominate the data organization over the next decade include:

So, how will the rise in specialized data roles — and bigger data teams — affect data quality?

As the data team diversifies and use cases increase, so will stakeholders. Bigger data teams and more stakeholders mean more eyeballs are looking at the data. As one of my colleagues says: “The more people look at something, the more likely they’ll complain about [it].”

Rise of automation

Ask any data engineer: More automation is generally a positive thing.

Automation reduces manual toil, scales repetitive processes and makes large-scale systems more fault-tolerant. When it comes to improving data quality, there is a lot of opportunity for automation to fill the gaps where testing, cataloging and other more manual processes fail.

We foresee that over the next several years, automation will be increasingly applied to several different areas of data engineering that affect data quality and governance:

While this list just scratches the surface of areas where automation can benefit our quest for better data quality, I think it’s a decent start.

More distributed environments and the rise of data domains

Distributed data paradigms like the data mesh make it easier and more accessible for functional groups across the enterprise to leverage data for specific use cases. The potential of domain-oriented ownership applied to data management is high (faster data access, greater data democratization, more informed stakeholders), but so are the potential complications.

Data teams need look no further than the microservice architecture for a sneak peak of what’s to come after data mesh mania calms down and teams begin their implementations in earnest. Such distributed approaches demand more discipline at both the technical and cultural levels when it comes to enforcing data governance.

Generally speaking, siphoning off technical components can increase data quality issues. For instance, a schema change in one domain can cause a data fire drill in another area of the business, or duplication of a critical table that is regularly updated or augmented for one part of the business can cause pandemonium if used by another. Without proactively generating awareness and creating context about how to work with the data, it can be challenging to scale the data mesh approach.

So, where do we go from here?

I predict that in the coming years, achieving data quality will become both easier and harder for organizations across industries, and it’s up to data leaders to help their organizations navigate these challenges as they drive their business strategies forward.

Increasingly complicated systems and higher volumes of data beget complication; innovations and advancements in data engineering technologies mean greater automation and improved ability to “cover our bases” when it comes to preventing broken pipelines and products. Regardless of how you slice it, however, striving for some measure of data reliability will become table stakes for even the most novice of data teams.

I anticipate that data leaders will start measuring data quality as a vector of data maturity (if they haven’t already), and in the process, work towards building more reliable systems.

Until then, here’s wishing you no data downtime.

Barr Moses is the CEO and co-founder of Monte Carlo.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!