Discovering, accessing and incorporating new datasets for use in data analytics, data science and other data pipeline tasks is typically a slow process in large and complex organizations. Such organizations generally have hundreds of thousands of datasets that are actively managed across a variety of data stores internally and access to orders of magnitude additional external datasets. Simply finding relevant data for a particular process is an almost overwhelming task.
Even once relevant data has been identified, going through the approval, governance and staging processes required for actual use of that data can take several months in practice. It is often a massive impediment to organizational agility. Data scientists and analysts are pushed to use pre-approved, pre-staged data found in centralized repositories, such as data warehouses, instead of being encouraged to use a broader array of datasets in their analysis.
Furthermore, even once the data from new datasets become available for use within analytical tasks, the fact that they come from different data sources typically implies that they have different data semantics, which makes unifying and integrating these datasets a challenge. For example, they may refer to the same real-world entities using different identifiers as existing datasets or may associate different attributes (and types of those attributes) with the real-world entities modeled in existing datasets. In addition, data about those entities are likely to be sampled using a different context relative to existing datasets. The semantic differences across the datasets make it hard to incorporate them together in the same analytical task, thereby reducing the ability to get a holistic view of the data.
Addressing the challenges to data integration
Nonetheless, despite all these challenges, it is critical that these data discovery, integration and staging tasks are performed in order for data analysts and scientists within an organization to be successful. This is typically done today via significant human effort, some on behalf of the person doing the analysis, but most done by centralized teams, especially with respect to data integration, cleaning and staging. The problem, of course, is that centralized teams become organizational bottlenecks, which further hinders agility. The current status quo is not acceptable to anyone and several proposals have emerged to fix this problem.
Two of the best-known proposals are the “data fabric” and “data mesh.” Rather than focusing on an overview of these ideas, this article instead focuses on the application of the data fabric and data mesh specifically to the problem of data integration, and how they approach the challenge of eliminating reliance on an enterprise-wide centralized team to perform this integration.
Let’s take the example of an American car manufacturer that acquires another car manufacturer in Europe. The American car manufacturer maintains a parts database, detailing information about all the different parts that are required to manufacture a car — supplier, price, warranty, inventory, etc. This data is stored in a relational database — e.g., PostgreSQL. The European car manufacturer also maintains a parts database, stored in JSON inside a MongoDB database. Obviously, integrating these two datasets would be very valuable, since it’s much easier to deal with a single parts database than two separate ones, but there are many challenges. They are stored in different formats (relational vs. nested), by different systems, use different terms and identifiers, and even different units for various data attributes (e.g., feet vs. meters, dollars vs. euros). Performing this integration is a lot of work, and if done by an enterprise-wide central team, could take years to complete.
Automating with the data fabric approach
The data fabric approach attempts to automate as much of the integration process as possible with little to no human effort. For example, it uses machine learning (ML) techniques to discover overlap in the attributes (e.g., they both contain supplier and warranty information) and values of the datasets (e.g., many of the suppliers in one dataset appear in the other dataset as well) to flag these two datasets as candidates for integration in the first place.
ML can also be used to convert the JSON dataset into a relational model: soft functional dependencies that exist within the JSON dataset are discovered (e.g., whenever we see a value for supplier_name of X, we see supplier_address of Y) and used to identify groups of attributes that are likely to correspond to an independent semantic entity (e.g., a supplier entity), and create tables for these entities and associated foreign keys in parent tables. Entities with overlapping domains can be merged, with the end result being a complete relational schema. (Much of this can actually be done without ML, such as with the algorithm described in this SIGMOD 2016 research paper.)
This relational schema produced from the European dataset can then be integrated with the existing relational schema from the American dataset. ML can be used in this process as well. For example, query history can be used to observe how analysts access these individual datasets in relation to other datasets and discover similarities in access patterns. These similarities can be used to jump-start the data integration process. Similarly, ML can be used for entity mapping across datasets. At some point, humans must get involved in finalizing the data integration, but the more that data fabric techniques can automate key steps within the process, the less work the humans have to do, ultimately making them less likely to become a bottleneck.
The human-centric data mesh approach
The data mesh takes a totally different approach to this same data integration problem. Although ML and automated techniques are certainly not discouraged in the data mesh, fundamentally, humans still play a central role in the integration process. Nonetheless, these humans are not a centralized team, but rather a set of domain experts.
Each dataset is owned by a particular domain that has expertise in that dataset. This team is charged with making that dataset available to the rest of the enterprise as a data product. If another dataset comes along that — if integrated with an existing dataset — would increase the utility of the original dataset, then the value of the original data product would be increased if the data integration is performed.
To the extent that these teams of domain experts are incentivized when the value of the data product they produce increases, they will be motivated to perform the hard work of the data integration themselves. Ultimately then, the integration is performed by domain experts who understand car parts data well, instead of a centralized team that does not know the difference between a radiator and a grille.
Transforming the role of humans in data management
In summary, the data fabric still requires a central human team that performs critical functions for the overall orchestration of the fabric. Nonetheless, in theory, this team is unlikely to become an organizational bottleneck because much of their work is automated by the artificial intelligence processes in the fabric.
In contrast, in the data mesh, the human team is never on the critical path for any task performed by data consumers or producers. However, there is much less emphasis on replacing humans with machines, and instead, the emphasis is on shifting the human effort to the distributed teams of domain experts who are the most component in performing it.
In other words, the data fabric fundamentally is about eliminating human effort, while the data mesh is about smarter and more efficient use of human effort.
Of course, it would initially seem that eliminating human effort is always better than repurposing it. However, despite the incredible recent advances we’ve made in ML, we are still not at the point today where we can fully trust machines to perform these key data management and integration activities that are today performed by humans.
As long as humans are still involved in the process, it is important to ask the question about how they can be used most efficiently. Furthermore, some ideas from the data fabric are quite complementary to the data mesh and can be used in conjunction (and vice versa). Thus the question of which one to use today (data mesh or data fabric) and whether there is even a question of one versus the other in the first place is not obvious. Ultimately, an optimal solution will likely take the best ideas from each of these approaches.
Daniel Abadi is a Darnell-Kanal professor of computer science at University of Maryland, College Park and chief scientist at Starburst.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!