In data engineering, an extract load job refers to the two stage step of extracting data from sources and loading into a date warehouse. In many situations, we need to run an EL job only a few times a day. Using a serverless setup seems very appealing. In the sequel, we will discuss how to best structure and organize a data integration app with multiple sources.
The anatomy of an EL job
Clear staging
We work with cloud-based analytics databases such as Snowflake, Redshift, or BigQuery. These databases are optimized for analytical workloads and use columnar storage. We most efficiently load data in large batches from medium sized files in a cloud store which allows the database products ingest data in parallel. The first step in a EL job is to remove old files from the staging area in the cloud store.
Extract
We extract data from the sources which typically are REST APIs, file servers or databases. I recommend to design extract tasks to have the following properties:
- limit: The extract process should allow to limit the number of records returned. This is useful for testing and debugging.
- start/end: The extract process should have some time field that allows to limit the records returned to a time range. This is useful for incremental loads. The time field might be a business field like a transaction date or a technical field like a record update timestamp. If no such field is available, we cannot do incremental updates but need to copy the entire source. For sources with a few thousand records, this is absolutely acceptable and the incremental load unnecessary overhead
- primary key: For large source, it is paramount that we're able to chunk the source data into smaller batches. The start/end time range is an efficient way to achieve this. If the source records have a primary key, we can create exactly the desired number batches hashing the primary key.
Now that we designed our extract tasks such that we can create reproducible chunks of records from the data source, we can not only perform incremental extract load jobs, but we can also run multiple extract tasks on the same source in parallel. We can effortlessly scale our serverless functions and the data ingestion throughput is only limited by the bandwidth of the data source provider.
However, we need some controller which creates suitable bundles of extract tasks for execution. In many situations, we want to prepare incremental loads and need to set the start/end fields automatically relative to the current timestamp. For such data source, we might define a lookback in days which is fed to the controller and returns a tuple of extract tasks with defined start and end timestamps.
Load
The load function simply triggers the copy jobs in the database product. We have two major cases:
- full update The data source has no primary key and we replace the entire table with the new data. In practice, we simply truncate the target table and copy the data into the target table.
- incremental update The data source has a primary key which allows the load process to update existing records with the latest version of the records. In practice, we load the new data into a staging table, delete all records from the target table which have primary keys in the staging table and finally insert the staging table into the target table.