Mark Van de Wiel

As organizations seek to gain business efficiencies through broader analytics, the need to move data between systems is greater now than ever before. To do this, organizations look to build an optimized analytical data processing environment, one that combines data from multiple systems, and is mostly in the cloud. Additionally, with ever-growing data volumes, and with mounting pressure to keep transaction processing systems available 24/7, there is a need for scalable data processing environments to receive updates in near real-time.

Given these needs, Change Data Capture (CDC) and Extract, Transform, Load (ETL) are two methods of data delivery styles that can help meet the growing demands for both real-time data processing and accessibility. Many think of these data delivery styles as either-or, but two can be used in a complementary deployment. In this blog, we discuss these two methods independently and how they can be used together to optimize the delivery of data.

Approaches to Change Data Capture

The concept of CDC as a replication method has been understood for many years in the context of database technologies, with multiple ways to achieve the identification of changes. Following are common approaches of CDC and it is important when researching solutions, to understand the different types of CDC methods available:

Last modified date: relying on an extra field/column in the table to indicate the timestamp when the most recent change was made. To identify changes, just extract data from the table and filter on the column, based on the last time the extract was run. Disadvantages of this approach include:

  1. how to deal with deleted rows that are no longer visible,
  2. how to process the change if the row identifier was modified, and
  3. getting the changes out from one table is not necessarily transactionally consistent with getting changes out from another table.

With this, depending on the type of data and the types of changes to the data you are processing, the last-modified date method can or cannot work for you.

Diff: computing the difference between the current data set on the target and the current set on the source. This approach identifies every difference but is very resource-intensive, and transactional consistency across multiple tables cannot be guaranteed when changes come in during the extraction of data from the source.

The diff method realistically only works for small to medium data set sizes.

Trigger-based CDC: This type of Change Data Capture uses triggers to identify every change and stores relevant information in a separate change table. Then data is incrementally extracted either by getting data out of the change table, or by joining the change table back with the source table to get the current values for relevant rows.

This approach does identify every change, however, the triggers slow down the original transactions, extracting changes still puts a spike of load on the system, and changes are still not retrieved transactionally consistently.

Log-based CDC: This method of CDC can asynchronously read and parse the database transaction log to obtain the changes in commit order. Log-based CDC does not slow down the originating transactions, yet captures changes as they are committed to the log, to be propagated in near real-time. Log-based CDC also tracks transaction boundaries irrespective of the load on the source database.

HVR, the leading independent provider of data replication using log-based CDC, supports log-based CDC from numerous technologies including Oracle, SQL Server, IBM DB2, SAP HANA, PostgreSQL, MySQL/MariaDB and more. Data can be replicated heterogeneously into a multitude of technologies including:

– All supported source database types

– Commonly-used data lake technologies like S3, ADLS, GCS or HDFS

– Cloud-native data platforms or data warehouses like Snowflake, Redshift or Synapse

– Traditional analytical database technologies like Teradata or Greenplum

– Streaming data technologies like Kafka

CDC is commonly used in use cases when data availability and reporting on data are required in real-time, and little to no transformations are necessary for analytics to be performed on the destination target.

Extract, Transform, Load (ETL) 

Icon_clock_blueTo facilitate analytical access to data, and to improve query performance, there is often a desire for complex transformations and this is when ETL is the preferred data delivery style. For example, a highly normalized source application schema may be transformed into star schemas with dimension tables and facts, joining multiple tables into one. Aggregations may be created for quick access to frequently requested results.

The concept of Extract, Transform, and Load has been around for decades and is still very relevant today. In the 1990s and early 2000s, ETL was by far the most commonly used method to populate a reporting system or data warehouse. Data would be extracted once a month, once a week, or nightly during a quiet period on the source system, and users had access to the same data set until the next ETL run. Of course, in today’s 24/7 economy a quiet period for the source system may not exist, and many organizations recognize the competitive advantage of having access to analytical data in near real-time.

Talendtalend-logo, a leader in Gartner’s 2019 Magic Quadrant for Data Integration Tools, provides one of the most popular ETL and data integration solutions in the market with its Talend Data Integration offering, as well as the Talend Data Fabric suite of data integration applications. Organizations can start easily with a download of the open-source Talend Open Studio, or in the cloud using Talend Cloud Data Integration.

ETL starts with Extract, and with the extract running against the transaction processing database, there may be a concern of resource consumption to feed the transformations. This is where the use of CDC comes in to provide ETL with a read-consistent copy of the data to extract from.

CDC feeds ETL

Performing extensive transformations during data replication is very inefficient due to the nature of data replication seeing data at a change-by-change/row-by-row level, and resolving operations like (multi) table joins and aggregations for every change requires a lot of repetitive computations.

As a result, data replication deployments typically perform transformations that can be efficient at the change/row level, like introducing extra column values, performing soft deletes (marking a row as deleted instead of actually deleting it), or keeping an audit trail of every table change as a new row.

The replicated data set is the ideal starting point for an ETL job, because:

  1. The data is no longer in the source, so the extract does not reach out to the transaction processing database.
  2. Changes arrive in (near) real-time.
  3. Data replication is the only process updating the data, and log-based data replication can maintain transactional consistency on the target.

Replication through HVR with ETL from Talend Cloud, End-to-End

Icon_file_replicationDuring data replication, HVR can – at a transactionally consistent point –  make a call-out to an external program, and Talend Cloud Data Integration provides the flexibility to start jobs through rest calls. This allows for an end-to-end setup with HVR instructing Talend what jobs to run.

Conceptually, the steps are:

– HVR performs log-based CDC on the source database to identify data changes.

– Changes are replicated to the target where they are integrated.

– To facilitate propagating deletes, the replication setup includes the use of soft-deletes, marking rows as deleted rather than actually deleting them.

– Also, to facilitate downstream processing an extra column is maintained with either a uniquely increasing sequence number from the source, or a timestamp, so that Talend Cloud Data Integration can filter for changes since the last ETL run.

– Once a consistent set of changes is integrated, HVR creates a manifest of the tables that were changed and makes a callout to an external program to call Talend Cloud Data Integration.

– The external program parses the manifest to know what jobs have to be started and runs these.

– Once the ETL jobs finish the external program gives control back to HVR and the next set of changes can be integrated.

A detailed how-to is below.

How-To Steps

The steps below show how to integrate HVR with Talend Cloud Data Integration. The steps assume that you already created a channel definition in HVR to move data from source to target, with any transformations you may want to perform in HVR (like performing a soft-delete, and/or adding one or more extra columns on the target to facilitate ETL). Then to make a callout to Talend Cloud Data Integration from inside HVR:

1. Go to: https://github.com/Talend/hvr-tcli

– Download the right executable for your environment and the log4j.properties
– Copy the download files on HVR Server to $HVR_HOME/lib/agent

2. In HVR, on the group that represents the target, create an Action of type AgentPlugin, and use the following arguments:

– Command = hvrmanifestagent.py
– Order = 1
– UserArgument = -m {hvr_integ_tstamp}.json -s <directory to write HVR Manifest File(s)>

3. In HVR, again the group that represents the target, create an Action of AgentPlugin with the following arguments:
– Command = callTalendCloud.bat | .sh
– Order = 2

4. In Talend Cloud, under your User Profile, obtain your Access Token      

5. Create a script under the HVR agent directory $HVR_HOME/lib/agent/callTalendCloud.bat|.sh

Script contents must be (below is for a .bat file):

IF “%1%” == “integ_end” (

 tcli -t <Talend Access Token> -r <Talend Region> -w -hm <Directory of HVR Manifest Files> -tm <Path to Talend Manifest file>/talend_manifest.json

)

6. Save the script.

7. Run HVR Initialize.

8. Start the job that was created.

To watch these steps in action, please watch https://www.youtube.com/watch?v=3dI_Jv7aZaA.

To get started with HVR, sign up for the Test Drive at https://www.hvr-software.com/test-drive-trial/, and to get started with Talend Cloud Data Integration, sign up at https://info.talend.com/request-talend-cloud.html.

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Test drive
Contact us