When it comes to managing access to data on an organizational level, a number of models are used, the latest and most dynamic being the data fabric. To understand its value, it’s helpful to understand how it fundamentally differs from an approach closely associated with Big Data: the data lake.
The data lake arose, partially, as a way to overcome the limits of the data warehouse, which was an attempt to collect and organize all the data that’s relevant to an organization in a single highly organized and tightly managed repository. The warehouse concept, however, didn’t scale to big data. Its schema-on-write model meant that data had to conform to strict rules before being input, which in turn required a lot of oversight, time and work, and often created bottlenecks in data pipelines.
With the ‘data lake’ concept (which emerged in the initial form of Hadoop), instead of a tightly controlled, well-organized library-style repository, you have a massive well that allows the input of any kind of data in raw form. For companies that collect massive data volumes at high speed, and who collect a range of structured and unstructured data, the appeal was massive.
The lake concept, however, has limitations, which are addressed by the subsequent paradigm of the data fabric. Let’s discuss why this is the case by exploring some of the similarities and differences between the two approaches.
How data lakes and data fabrics are similar
Data lakes and data fabrics have a few things in common, both conceptually and functionally. For example, both of these systems:
How data lakes and data fabrics are different
While both systems create a way to handle massive volumes and varieties of data at high velocity (the three v’s of big data), they also differ in fundamental ways:
One final point is that a data fabric may contain a data lake, but a data lake will never contain a data fabric. A data fabric exists at a higher level–it is an abstraction of all kinds of data sources, one of which may be a data lake. So, with a data fabric, you can have an overarching access layer that includes Hadoop along with other repositories, such as those that are more appropriate for smaller data assets, or relational data.