July 23, 2021

What is ETL (Extract, Transform, Load)? And What Are the Considerations?

Extract, transform, load (ETL) describes the process of copying data from a single or multiple repositories into a new centralized system.

 Kaycee Lai

Kaycee Lai

Founder

Extract, transform, load (ETL) describes the process of copying data from a single or multiple repositories into a new centralized system–such as a data warehouse, data mart, or operational data store (ODS).

You can’t just take data from one repository and throw it into a new one. You’ve got to follow a set of procedures. Done right, ETL gives you data in a ready-to-use state for analysts, developers and department decision-makers. As you might guess, however, it’s a big job, and we’ll also discuss why it’s not always necessary for analysis, and why you shouldn’t do it anymore than you absolutely need to.

Extraction–the ‘E’ in ETL

The extraction phase involves gathering the data from sources which may use different data schemas and formats, and may also be semi structured or unstructured. These sources may be owned and guarded by different departments within your organization.

Transformation–the ‘T’ in ETL

Once the data has been extracted, the Transformation phase involves applying rules to prepare it for its new home. This may involve:

Load–the ‘L’ in ETL

During the ‘Load’ phase, as you might suspect, the data is loaded into its new repository. In some cases the data is entirely new, while in other cases it overwrites existing data. Loading may happen in hourly, daily, weekly, or monthly intervals, depending on resources and needs of the business (such as the need to analyze very new data).

The ‘Load’ phase follows strict guidelines. At this time whatever rules or constraints have been defined by the schema–such as distinctness, non-null fields, referential integrity–are applied. Depending on the rules you’ve set up some data might be rejected outright, some partially rejected, and some (hopefully most) fully accepted.

The Challenges

Needless to say, ETL is fraught with challenges, and many problems can occur. For instance, the range of data values that the system needs to handle may wind up being wider than originally anticipated, and may require that the validation rules be updated. Additionally, scalability may be an issue, and systems that start out only being required to process gigabytes of data may need to eventually handle terabytes or even petabytes.

ETL can be slowed down for many reasons. For instance, if there are multiple input sources, and those sources must be reconciled, a source that is quickly extracted and transformed may be delayed during the load process by a slower source until they can be properly synchronized.

Obviously, the faster an ETL system is able to work, so performance boosting techniques, such as the following, may be beneficial:

Selecting an ETL Tool

Of course it really depends on the specifics of your organization’s data needs and data architecture, but some things to think about when selecting an ETL tool include:

Don’t ETL any more than you need to

At the end of the day it should be acknowledged that ETL is a big process. Do it when you need to, but if it’s not absolutely necessary, don’t. As the number of data sources and volume of data grows, when is ETL no longer a viable option? As one of the major purposes of ETL is to have data ready for analytics, when it becomes so cumbersome that it actually slows the analytics process it may be time to consider another direction.

For instance, conventional wisdom says that to have all your data ready for analysis, you’ve got to ETL it into a massive warehouse. But this isn’t always the case. If you want to find out why creating a virtual layer that allows you to access all your data without copying or moving it, read on.

Related Blog Posts

December 2, 2024

Future-Proofing Your Enterprise: Navigating Security and Governance

Generative AI (GenAI) has the power to revolutionize how enterprises operate by enabling new levels of automation, efficiency, & innovation.

Continue Reading »
September 23, 2024

Promethium Recognized in Gartner’s Market Guide for Metadata Management Solutions

Businesses increasingly rely on metadata management solutions to enhance interoperability, streamline workflows, and improve governance.

Continue Reading »
August 20, 2024

Gartner Hype Cycle Review: Data Fabric Continues to Mature

In the dynamic world of data management, knowing technological trends is essential to maintaining competitive advantage.

Continue Reading »