Reposted from Jay Piscioneri on Eckerson.com
In the marketplace of ideas, data fabric is on a bull run. Many software companies now have data fabric offerings, and it’s the subject of many seminars, whitepapers, and ebooks. This article will delve into what a data fabric is and why it’s crucial. And we’ll look at how one product, Promethium, implements a data fabric to accelerate self serve analytics.
First, let’s define the term. Data fabric is an architectural approach that uses metadata, machine learning, and automation to weave together data of any format from any location. Its objective is to make data easier to find, use, and manage in light of today’s massive data volumes and constant change.
Data fabric automates functions such as onboarding new data sources, managing metadata, and combining data from different sources. It creates an abstraction layer that eliminates the complexity of working with distributed data and disparate formats. It provides seamless access to enterprise data, which helps data teams keep up with the exploding volume of user requests.
Data Fabric’s Core Components
There is no one way to build a data fabric. But for a data fabric to achieve its objectives of making data easy to find, use, and manage it should consist of the following core components (see Figure 1):
Metadata. The foundation of data fabric is metadata. This includes technical metadata that describes the properties of data sources such as format, data types, and access protocols; business metadata such as labels and classification; social metadata such as tags and annotations; and operational metadata about how data is used.
Processing layers. The operational work of a data fabric involves providing connections to data sources, managing workflows, abstracting data to create a unified view, and ensuring access to data is appropriately controlled.
Application functionality. The interactive functions of discovery, cataloging, and preparation enable data teams to manage data. They also enable business users to find and work with data.
AI/Machine Learning. Using the fabric’s metadata, AI and machine learning automate many aspects of the processing and application functions reducing the workload of data teams and users.
Promethium as a Data Fabric
To get a feel for how a data fabric works in practice, it will be helpful to examine a commercial product that implements one. Promethium is a self-service analytics workbench that claims to be “the most comprehensive Data Fabric available today”. Its objective is to make it easy for business users to ask questions of distributed data and get answers quickly.
Since Promethium is optimized to support self-serve analytics, we’ll focus on the application functions that data consumers interact with to find and use data: discovery, preparation, and cataloging.
The Role of Discovery. Discovery is the data fabric function that makes data easy to find. Without effective discovery, the data needed for a particular purpose is lost among terabytes or petabytes of other data. It enables users to search for or browse through available data sets, and preview data from different sources to better understand whether it helps them in their analysis.
Promethium’s discovery function uses natural language processing, a form of artificial intelligence, to make finding data easier. It takes regular sentences or phrases and intelligently parses the words to derive search terms. For example, a sales manager can enter “What are product sales by month?”. Promethium determines the keywords and scans its metadata to find relevant data elements that a data analyst or the data team can use to develop a query that answers the question.
Implementing discovery through natural language search is one of Promethium’s differentiators. Another unique feature is how Promethium tracks queries and the questions they answer in its metadata. So if there’s already a query that shows product sales by month, the manager gets their answer immediately.
The Role of Preparation. Preparation is the data fabric function that enables power users to access data from one or more source systems without having to deal with different formats or distributed locations. For example, they can create a dashboard view of aggregate sales data that comes from several systems or prepare a blended data set of ecommerce and social media activity to train a machine learning model.
Promethium’s data preparation function is a no-code, federated query builder. It enables power users to connect to over 200 data sources to create abstracted data sets by querying the data in place. They can also create data pipelines that move data from one place to another and even from one platform to another, such as from Oracle to Snowflake.
Promethium automates data preparation by using its metadata and natural language processing. For example, Promethium parses the question “What are product sales by month?” to find data resources that could provide an answer. The federated query builder takes the discovered resources and automatically generates SQL code that transforms data. For example, it joins tables, sorts, groups, subtotals, and filters data to help answer the question. Power users can edit the generated code or write their own.
The Role of Cataloging. Cataloging is the data fabric function that enables users to create, view, and manage metadata. It’s the interface that displays information about data resources facilitating discovery. It also automates the generation of technical, business, and operational metadata.
Promethium offers a good example of a data fabric cataloging function. It enables users to interact with metadata and also automates cataloging processes. For instance, when the data team onboards new data, Promethium automatically classifies, tags, and indexes it. As consumers use the new data, Promethium automatically learns how it joins with other data and what queries it appears in. Users can add their input, enriching the metadata with tags, ratings, and annotations.
Conclusion
As we’ve seen through Promethium’s example, a data fabric can make it easier to find, use, and manage today’s diverse and distributed data. Intelligent automation can expedite tasks that a data team would otherwise have to do manually.
But there are some things to consider:
An evolving product market. The data fabric product market is evolving. While most products have the core components that we discussed here, they approach data fabric differently. Promethium’s founder has a background in data cataloging. This explains the product’s emphasis on discovery with natural language processing, a feature many products don’t yet have.
The effectiveness of intelligent automation. Most data fabric products use AI/machine learning to automate tasks such as cataloging new data and data preparation. This can eliminate time-consuming manual steps as long as the automation makes good decisions. If manual review and correction are required often enough, the time savings is degraded and service bottlenecks will persist.
Build or Buy. Building your own data fabric is a significant undertaking that requires a lot of engineering and data science resources. For many companies buying a product is the only viable option. But a data fabric is not just one product; it’s a suite of products. This makes the risk of vendor lock-in especially acute.
In spite of these concerns, the promise of data fabric outweighs its perils. Promethium demonstrates some of the promises. For example, from a natural language question to an executable query without having to know how to blend data from different platforms is pretty cool. As with any new technology, companies considering data fabric must carefully evaluate their needs, have realistic expectations, and experiment with small implementations before fully committing.