For data engineers, one of the biggest challenges involves the fact that they’re usually dealing with data infrastructures in which data is ‘siloed’ between different entities, making it difficult to get the necessary insights to answer important questions. For this reason, ‘data silos’ have become something of a dirty word within the data management, analytics and storage industries.
Want to learn how to enable self-service analytics across multiple data sources without moving data? Check out this whitepaper from Eckerson, The Road to Self-Service Bliss
As such, the conventional wisdom is that they need to be eliminated. But should data silos really be considered a mere nuisance which, when removed, will allow companies to finally move forward as data-driven organizations? In this article we’ll discuss why data silos are actually necessary, as well as alternate approaches that allow organizations to retain their benefits while working around the difficulties they pose.
Data Silos Serve a Purpose
Consider a parallel trend that’s taken hold in Silicon Valley–the open workspace. Companies need to collaborate better–that’s a given. Walls, so the thinking goes, separate people. So, if you get rid of walls in an office space, we’ll all be one happy family, collaborating Gen Z style. Of course it hasn’t really worked out like that. Walls, it turns out, serve many legitimate and critical purposes. They allow confidential things to be discussed in private, allow people to focus intensely on things that require prolonged periods of concentration, and give introverts a break from constant interruptions.
Likewise, data silos are not just a mistake to be rectified. They exist for many reasons, several of which are outlined by Bin Fan and Amelia Wong in 97 Things Every Data Engineer Should Know, including:
So, it’s clear that a one-size-fits all approach may not be the best–particularly for large organizations with multiple data types and a wide range of requirements. Knocking down all the silos–at least in the literal sense–in many cases, wouldn’t make any more sense than eliminating the silos in a grainery. Throwing all the data into a giant data warehouse or Hadoop would not only be impractical from a cost perspective, but may have some very negative repercussions.
What’s required is a data mesh-styled method of allowing people across organizations to get to all the data within the architecture as it currently exists, without duplicating data or adding more layers of complexity.
Coexisting with data silos by using a Data Fabric
We can learn to peacefully coexist with our data silos by implementing a concept known as the Data Fabric–a layer of abstraction between storage repositories and the users who need to access the data, that allows them to view and access it as a single repository. It’s not really a single repository– data that needs to be physically separated is still held in distinct storage schemes. But, the access is standardized through virtualization.
In such an environment, data engineers can pull data from multiple sources into the same analytics platform. Data from Oracle might be joined with data in Hadoop to provide a single calculation. At the same time, the data engineering team does not need to micromanage every data source–that can be managed at the department level by people with the domain expertise necessary to understand the nuances of how it needs to be handled. The optimum storage decisions can be made without worrying about how it might impact the work of analysts.
If you want to learn more about how a data fabric can accomplish all of this, read on.