In data science, a major objective of data preparation involves getting data ready to be ingested by machine learning (ML) algorithms. This, according to a report by analyst firm Cognilytica, is no small task, taking more than 80% of an ML project’s timeline--a problem that can only be remedied by what is described as a Unified Analytics Warehouse. If you have trouble empathizing with the amount of time and effort data scientists must spend to simply get data in usable form, consider the following.
Machine Learning Algorithms are a Tough Audience
ML algorithms are like finicky cats--they typically don’t like the data you feed them. To paraphrase Jeff Bridges in Tron, they “don’t dig imperfection”. As we’ll learn, few things are as imperfect as today’s enterprise data management. So, data scientists have to do some serious legwork before their data is ready for even the most basic ML models. The collection and preprocessing steps involved in preparing data for machine learning typically include:
Formatting consistently. For example, you can’t have one value such as “$8.99” and another as “8.99” -- an ML function will just throw an error code. For that matter, symbols like ‘$’ have to be eliminated, as most ML models don’t care much for them either.
Removal of what Cornell calls “messy data” such as missing values, duplicates and outliers. There are often multiple copies of datasets required for ML, and these different versions are often stashed in different locations. Furthermore, they usually have varying degrees of ‘completeness’ and present the same data in different formats, and with different schemas.
Feature scaling. Depending on the kind of ML model, data teams might need to ‘scale’ the data to deal with different units of measurement. For example, if your KNN (K-nearest neighbors) model takes into account ‘age’ and ‘income’ to predict a customer's behavior, unless it’s scaled it may think that income is much more important than age simply because it typically has more digits.
High-dimensional data. Some data sets offer an overwhelming number of metrics and measurements, so much so that data scientists don’t know where to start deriving insight. Furthermore, such data may be impossible to visualize. For instance, healthcare data on patients may have hundreds of descriptive features on a single patient, ranging from blood analysis metrics to genetic characteristics to personal history. A data scientist may perform ‘dimensionality reduction’ such as principal component analysis (PCA) to reduce the features and create an interpretable model.
Today’s Chaotic State of Data Architecture Doesn’t Help
On top of all this, data scientists have to deal with a slew of challenges related to the state of today’s data architecture:
Missing pieces. True, the volume and velocity at which companies are collecting data is skyrocketing, but that doesn’t mean that data sets available to data science teams contain all the information needed to build a coherent model. On the contrary, critical pieces are often buried in different systems, dispersed throughout the organization. Data discovery, consequently, may require dealing with many departmental data gatekeepers, who are not always prompt and accommodating--something even the most tightly curated data catalog won’t help with .
No complete ‘map’ of all the organization’s data. Today’s enterprise data architecture is more like a sprawling metropolitan nightmare than a well-organized city. Data exists in departmental silos in warehouses, databases and data lakes and no one has the complete picture. As a result, data scientists don’t know what’s available.
Data science teams need help, and the solution must be more than a band-aid. They need to be able to combine data from various sources and provide a comprehensive view. Furthemore, they require the ability to quickly perform exploratory analysis, which is a critical first-step in determining which ML model is most appropriate to answer the question at hand. Finally, they need as much of the pre-processing automated as possible.
Data Scientists, Now More Than Ever, Need a Unified Analytics Warehouse
Unified Analytics is an attempt to manage the gap between data engineering and data analytics/data science, bridging the disciplines to make it possible to operationalize processes like machine learning.
As John Santaferraro of EMA asserted, UAW is unified, because it handles multi-structured data in a single platform, as well as a warehouse, because it stores multi-structured data in an organized and accessible manner. He further notes that a UAW needs to support the full range of analytics approaches--not just BI tools like Tableau, Looker and Power BI, but also independent development environments (IDEs), such as RStudio and Python Spyder, and notebooks like Jupyter and Google Colab. It also needs to offer ready access to multi-structured data using SQL. In a nutshell, a UAW must unify all interactions between data scientists and the data architecture through a ‘single pane of glass’.
This can be accomplished through virtualization, which involves using software to create a “virtual layer” of simplification that allows data scientists to sidestep the underlying complexities of IT architecture.
If you’d like to learn more about how Promethium has partnered with Starburst to use virtualization to build a Unified Analytics Warehouse, read on.