It’s well understood at this point that data scientists and data engineers have a wealth of expertise that’s difficult to come by and in-demand. Your data science team, however, has much more than that. They’ve got the ‘lay of the land’--irreplaceable knowledge about your company’s enterprise data landscape, which unfortunately disappears when they move on to another company. Consider that to answer a business question or build a solution to a business challenge, they’ll have to figure out:
What tables hold the data required to answer the question, and where they reside
The overall condition of the tables, and whether the data is usable (and if not, what needs to be done to make it usable)
The location of similar tables which might reside in alternate databases that contain information that overlaps or duplicates the desired data
Which tables might be best suited for certain analytic methods or models
Any issues with regulatory compliance or customer data privacy that need to be resolved before the data is analyzed
These are of course just a few examples of learnings that data scientists and engineers gain that make them a treasure-trove of information that can accelerate future analytics projects, and even protect your organization from lawsuits or regulatory inquiries. Unfortunately these ‘treasure-troves’ don’t stick around long--a recent poll by Burtch Works suggests that on average data scientists may stay at your company for 2.6 years. That means that in less than 3 years your data science team could be totally different--consider it a proverbial ‘delete’ button on the institutional knowledge of your data landscape that gets pushed 3-4 times a decade.
Sidestepping the “Delete” Button on Institutional Knowledge
With the costs and risks of data-related delays becoming more serious as the business environment becomes more regulated and competition stiffens, this situation is unacceptable. Not surprisingly, it has become a major stumbling block in companies’ efforts to be data driven. Randy Bean of New Vantage Partners and Thomas Davenport of Deloitte reported in Harvard Business Review that the number of executives reporting adoption of Big Data as a major challenge rose from 65 percent last year to 75 percent in 2019. Of those respondents, 93 percent identify people and processes as the obstacle.
Fortunately, there are things that can be done to ensure that the institutional knowledge of your data landscape remains intact, regardless of who leaves your organization.
Clearly the solution to institutional knowledge-loss involves capturing information such as: the business problem that prompted the question, the question itself and the tables that held the data.
If the data required to solve a particular problem happened to be in multiple systems such as Oracle DB, Snowflake or Hadoop, the SQL statements that collected that data might be a nightmare to reconstruct, so those must be captured too.
You could have people input these questions into some kind of documentation scheme. But ask yourself, how likely are they to do that? As they say in the UK, not bloody likely. Additionally, the high level business question will in many cases have been asked by a business expert, and not the analyst who did the data munging, making it even tougher to collect all the information in a cohesive format.
This is a job for Machine Learning
Clearly, human beings can’t be trusted with this kind of tedious documentation (if you require your data scientists to do this, they’ll probably just get different jobs). This process has to be automated from beginning to end. Data catalogs offer a start by organizing the data, but they don’t give you the context under which it might be useful to your organization--that information resides only in the brains of your data science team,.and here’s what companies need to capture and retain it:
The ability for a non-technical business expert to ask a question in plain English through NLP software, and then have potentially relevant data automatically located across multiple systems
Capabilities for visually mapping out the data sets that might have relevance to the business question, and offering details such as completeness and potential compliance issues, as well as how they may relate to each other for potential joins
Auto-generation of a SQL statement that assembles the necessary tables (which, once again, might be in disparate systems), with the ability to execute it over a distributed SQL query engine
And to complete the loop, all of this information has to be captured in a repository that makes it searchable by the original question. With this in place, invaluable information on the location and state of data sets can be retained forever. An analyst faced with a business question merely poses that question to the system (much like a Google Search). If it happens that this question was posed by one of their predecessors, they’re guided to the full results of that search--all of their predecessor’s knowledge is at their fingertips, so that they can build on top of what was done before rather than rebuilding it from scratch. If that sounds like science fiction, read on.