As the variety of data being used to make data-derived decisions expands, the scope and requirements for metadata services also expands--both in its importance, and in the range of services required.
Want to learn more about data pipelines, data engineering and related topics?
Get the O'Reilly book, 97 Things Every Data Engineer Should Know - free with compliments of Promethium.
Data has taken on new dimensions. We’ve all been hearing for years about the 3 Vs, and are well aware that most organizations are facing a deluge of data. However, this isn’t just changing needs for processing, storage, and retrieval. Analysts are approaching data in novel ways, combining data from different departments that was possibly collected for different reasons, and even mining both structured and unstructured data together for new insights.
In this environment, we’ve seen an emerging need for progress in data discovery, data management, and data security for organizations. These areas have a critical linking thread: metadata. Traditionally, data catalogs have played a role in both capturing and assigning metadata to datasets. However, as data catalogs were originally created to facilitate governance/compliance, with analytics actually being a secondary concern, we’ve seen recently that further capabilities are required of data management services in the future.
Recently, Twitter software engineer Lohit Vijayarenu riffed on this topic, noting that data engineers should have several features top of mind while considering metadata services, including discoverability, security control, schema management, and an application interface/service guarantee.
Read Lohit Vijayarenu’s article in the 97 Things Every Data Engineer Should Know eBook, available for free download courtesy of Promethium
Data Discovery
In today’s organizations the data you might need to answer a business question could be just about anywhere, from a SQL database housed in Nevada to a Hadoop cluster partitioned across multiple continents. Effective metadata services provide a means for users to search for data across an organization through a single, unified portal. This not only makes it possible to much more quickly discover data for business questions, but it also negates the need to hardcode data locations into applications, or for that matter to create many of the department-centric custom solutions that have been required by siloed data.
To do this effectively, metadata services must be employed through a unified virtual data layer, or a data fabric, that sits on all your data, regardless of where the repositories reside--Hadoop, Oracle, Snowflake, Teradata, etc.--and allows you to access or add to metadata in any crevice of your organization.
Security
Metadata services support the implementation of organization-wide unified security controls. Specifically, data engineers should look for metadata services that facilitate authorization protocols for the same data stored in a distributed manner in different systems, through unified interfaces that can be configured to ensure organization-wide policy compliance and facilitate auditing.
Here, metadata services--particularly those that leverage natural language processing for semantic search--can be particularly useful in unique ways, such as identifying sensitive information and enabling the quick search and retrieval of specific data based on a wide variety of metadata.
Ability to Define and Query Database Schemas
As a rich schema aids in not only the fine-tuning of security parameters and discovery, but also the building of applications, an effective metadata service should allow users to both define, and query schemas for datasets, and provide features for schema validation as well as database versioning.
A flexible, highly available, and user-friendly interface
Finally, data engineers should consider a metadata services interface that offers not only user-friendliness--to the point that even non-technical users can access it--but also is highly available, and can be easily iterated. This last point is critical as data is becoming increasingly dynamic and data models will be constantly evolving, so the ability to iterate metadata will be critical.
To learn about how a data fabric can aid in giving you everything you need from metadata services, read on.