What is a data fabric? How it helps organize complex, disparate data

Enterprise IT departments and the data scientists in them use a variety of metaphors to describe how they collect and analyze information, from the data warehouse to the data lake and sometimes even a data ocean. All the metaphors capture some aspect of how the data is gathered, stored and processed before it is analyzed and presented.

The idea of a data fabric emphasizes how the bits can take different paths that eventually form a useful whole. To extend the metaphor, they follow, connect and unite different threads that are woven or knitted together into something that captures what is going on throughout the enterprise. They build a bigger picture.

The metaphor is often used in contrast to other ideas like a data pipeline or a data silo. A good data fabric is not a single pathway, nor is it isolated. The information should come from many sources in a complex network.

The breadth and complexity of the network can be large. The data comes from different sources, perhaps spread out across the globe, before being stored and analyzed by different local computers. There are often many data collection machines like point-of-sale terminals or sensors embedded in an assembly line. Local computers aggregate the data and then pass on the information to other computers that continue the analysis. Eventually, the results are passed on as reports or screens on dashboards used by everyone in the enterprise.

The goal of the metaphor is to emphasize how a complete and useful product is constructed out of many sources. The scientists may end up using other metaphors should they store the information in a data lake or a big data system. However, this metaphor of a data fabric is meant to express how complex and integrated the data gathering process may be.

What are some hallmarks of a data fabric?

Data scientists use a number of other terms alongside the data fabric that also emphasize some of the most important features. Some of the most commonly found are the following:

What are some challenges for building a data fabric? Many of the biggest problems for information and data architects involve low-level integration. Enterprises are flooded with different computer systems that were created at various times using different languages and standards. Because of this, much of the work involves finding a way to create connections, gather data and then transform it into a consistent format.

One conceptual challenge is distributing the workload throughout the network. Designs can benefit when some of the analysis is done locally before it is reported and passed along. The timely use of analysis and aggregation can save time and network bandwidth charges.

Architects must also anticipate and design around any problems caused by machine failures and network delays. Many data fabrics can include hundreds, thousands or even millions of different parts and the entire system can shut down waiting for the results from one of them. The best data fabrics can sense failures, work around them, and still generate useful reports and dashboards from the working nodes.

However, not all challenges are technical. Simply organizing the various sections can be politically challenging. The managers of different parts of the enterprise may want control over the data they produce and they might not want to share it. Persuading them to do so could require negotiations.

Additionally, when the different parts of the data fabric are controlled by different companies, the involvement of legal teams may be needed for negotiation. Occasionally, these different sections are also in different countries with contrasting regulatory frameworks and rules for compliance. All of these issues can make it frustrating to build a data fabric that connects a global enterprise.

Some data fabric developers create special layers of control or governance which establish and enforce rules on how the data flows. Some reports and dashboards are only available to those with the right authorization. This control infrastructure can be especially useful when a data fabric spans several companies or organizations.

One particular area of concern is the privacy of the information. Organizations often want to protect the personal information of their members and employees. A good data fabric architecture includes security and privacy protections to combat inadvertent disclosure or malicious actors. Lately, governments have also imposed strict regulations on personally identifiable information (PII) and data fabrics must be able to handle compliance for all regions.

How are the major players approaching data fabrics?

Large cloud companies are optimized for creating data warehouses and lakes from information gathered around the globe. While they don’t always use the term 'data fabric' to describe their tools, their business model is ideally suited for companies that want to create their own data fabric out of a wide collection of their tools. Some may even want to create multicloud collections when it makes sense to use the cloud for some part of a system. Other times, they may want to use another cloud for a different part or, maybe even an on-premise collection of machines for yet another component of the system.

IBM offers a number of software packages for data collection and analysis that can be used to create a large data fabric. They specialize in large enterprises that need the analysis that can help manage often disparate groups. Their tools span multiple clouds and include a number of options that were developed for more particular applications. For example, some data fabrics include data science from IBM’s Cloud Pak for Data or artificial intelligence (AI) models developed with IBM’s Watson.

Amazon’s Web Services (AWS) offers a number of data collection and analysis tools that can be used to knit together a data fabric. They offer many databases and data storage solutions that can support a data warehouse or data lake. They also offer some raw tools for studying the data, such as Quicksight or DataBrew. A number of their databases, including Redshift, are also optimized for producing many basic insights. AWS also hosts other companies such as Databricks on their servers, offering many options for creating a data fabric out of the tools from many merchants.

Google’s Cloud also offers a wide range of data storage and analytics services that can be integrated to build a data warehouse or fabric. Their tools range from basic tools like Dataflow for organizing data movement to Dataproc for running open-source tools like Apache Spark at scale. Google also offers a collection of AI tools for creating and refining models from the data.

Microsoft’s Azure cloud also offers a similar collection of data storage and analytics tools. Their AI tools like Azure Cognitive Services and Azure Machine Learning can help add AI to the mix. Some of their tools like Azure Purview are also designed to help with practical tasks of governance like tracking provenance or integrating multiple clouds across political and corporate boundaries.

Oracle offers tools that can create a data fabric, or what they sometimes call a data grid. One of them is Coherence, a product they consider middleware. This is a queryable tool that connects multiple databases together, parceling out requests for data and then collecting and aggregating the results.

How are startups and challengers building data fabrics?

A number of startups and smaller companies are building software that can help orchestrate the flow of data through enterprises. They may not create all of the data storage and data transmission packages but they can work with other products that speak common standards. For example, many products rely upon SQL databases and the architects of data fabrics may choose between several good options that can be hosted in many clouds or locally.

Talend, for example, delivers a mechanism for integrating data sources throughout the enterprise. The software can automatically discover data sources and then bring their information into the reporting fabric when they speak the standard data exchange languages. The system also offers the Talend Trust Score, which tracks data quality and integrity by watching for gaps or anomalies that may corrupt the reporting.

Astronomer offers managed versions of the open-source Apache Airflow that simplify many processes. Astronomer calls the foundation of their system “data pipelines-as-code” because the architects create their fabric by specifying any number of data pipelines that link together data science systems, analytics tools and filtering into a unified fabric.

Nexla breaks down the job of building a data fabric into one of linking together their Nextsets, tools that handle the raw chores of organization, validation, analysis, formatting, filtering etc. Once the data flows are specified by linking them together, Nexla’s main product controls the data flows so that everyone has access to the data they need but not the data that they aren’t authorized to see.

Scikiq offers a product that delivers a holistic layer with a no-code, drag-and-drop user interface for integrating data collection. The analysis tools include a large amount of artificial intelligence to both prepare and classify the data flowing from multiple clouds.

Is there anything that a data fabric can’t do?

The layers of software that build a data fabric rely heavily on storage and analysis tools that are often considered separate entities. When the data storage systems speak standard protocols, as many of them do, the systems can work well. However, if the data is stored in unusual formats or the storage systems aren’t available, the data fabric can’t do much.

Many of the fundamental problems with the data fabric can be traced back to issues with data collection. If the data is noisy, intermittent or broken, the reports and dashboards produced by the data fabric may be empty or just plain wrong. Good data fabrics can detect some issues, filter them out and include warnings with their reporting, but they can’t detect all issues.

Data fabrics also rely on other libraries and tools for their data analysis. Even if these are provided with accurate data, the analysis is not always magical. The statistical routines and AI algorithms can make mistakes or fail to generate the insights we hope to receive.

In general, data fabric packages have the job of collecting the data and moving it to the different software packages that can analyze it. If the data is not available or the analysis is incorrect, the data fabric is not responsible.

What are some hallmarks of a data fabric?

How are the major players approaching data fabrics?

How are startups and challengers building data fabrics?

Is there anything that a data fabric can’t do?

More