Four Methods of Change Data Capture

As its name suggests, Change Data Capture (CDC) techniques are used to identify changes. CDC can be the basis to synchronize another system with the same incremental changes, or to store an audit trail of changes. The audit trail may subsequently be used for other uses e.g. to update a data warehouse or to run analyses across the changes e.g. to identify patterns of changes. In this blog post, I will describe the four common methods to perform CDC, and some of their challenges.

1. DATE_MODIFIED

Many transactional applications keep track of metadata in every row including who created and/or most-recently modified the row, as well as when the row was created and last modified. The approach to CDC in such an environment is to keep track of when changes are extracted, and in a subsequent run filter on the DATE_MODIFIED column to only retrieve rows that were modified since the most recent time data was extracted. This approach has a few challenges that may or may not be a concern, depending on the application:

  • Data deletes are a challenge because there is no DATE_MODIFIED for a deleted row (unless deletes are logical and update a flag in the row indicates the row was deleted). The extreme case of delete is truncate table which is uncommon in transactional applications but does occur sometimes.
  • DATE_MODIFIED must be available on all tables and must be reliably set. Database triggers may be a good way to set the values but these may introduce overhead on the transactional application.
  • Extracting the changes uses a lot of resources. Of course DATE_MODIFIED may be indexed to lower the impact of the select statement at the cost of storing (and continuously updating) the additional index.

Using DATE_MODIFIED for CDC works well for traditional data warehouse applications that are populated using Extract, Transform and Load (ETL) jobs, when the source tables don’t process deletes.

2. Diff

The diff method for change data capture compares the current state of the data with previous state of the data to identify what changed. Challenges with this approach include:

  • To perform the diff requires a lot of resources to compute the differences between the data, and resource consumption grows at least linearly with the growth in data volume.
  • CDC cannot be performed in real-time because the diff realistically takes too many resources to perform all the time.

Compared to the DATE_MODIFIED CDC method the diff method does not have the challenge with deleted rows. The diff method works well for low data volumes.

3. Triggers

Database triggers can be used to perform CDC in shadow tables. The shadow tables may store the entire row to keep track of every single column change, or only the primary key is stored as well as the operation type (insert, update or delete). The use of database triggers to perform CDC also has a few challenges:

  • Firing the trigger, and storing the row changes in a shadow table, introduces overhead. In an extreme case CDC may introduce 100% overhead on the transaction i.e. instead of .1 second it may take .2 seconds to complete a transaction.
  • The lower-overhead alternative to only store the primary key of the table requires a join back to the source table to retrieve the changes which (1) increases the load to retrieve the changes, and (2) loses intermediate changes if multiple changes took place on the same row.
  • Should the source application perform a truncate then chances are the trigger won’t fire and changes are not recorded. Also, if changes are made to tables then triggers and shadow tables may also have to be modified, recreated and/or recompiled which introduces extra overhead to manage and maintain the database.

CDC using database triggers lowers the overhead to extract the changes but increases the overhead to record the changes.

4. Log-based Change Data Capture

Transactional databases store all changes in a transaction log in order to recover the committed state of the database should the database crash for whatever reason. Log-based CDC takes advantage of this aspect of the transactional database to read the changes from the log. The challenges with log-based CDC are:

  • Interpreting the changes in the transaction log is difficult because there are no documented standards on how the changes are stored (i.e. transaction logs from different database vendors are completely different), and there are many scenarios that must all be considered and tested (e.g. consider clustered databases, rollbacks and savepoints, many different ways to perform inserts, updates and deletes, etc.).
  • Database vendors may not provide an interface to the transaction logs – documented or not – and even if there is one it may be relatively slow and/or resource intensive.
  • Most databases have been optimized to only use internal identifiers to recover database row changes which is insufficient to perform CDC and record the changes on a different system. Supplemental logging of primary key columns is required to retrieve the context of the updates. The introduction of supplemental logging will increase the volume of data written to the transaction logs but generally only by a small percentage, and generally there is very little if any measurable performance impact on the transactional application.

Benefits of Log-Based Change Data Capture

log based change data captureThe biggest benefit of log-based change data capture is the asynchronous nature of CDC: changes are captured independent of the source application performing the changes. Dedication and smart software engineers can take care of the biggest challenges. Log-based CDC is generally considered the superior approach to change data capture that can be applied to all possible scenarios including systems with extremely high transaction volumes.

HVR supports log-based CDC for all supported relational database sources. Trigger-based capture is still supported on most source databases for legacy reasons, or in scenarios when the source database does not provide the required functionality to perform log-based CDC (e.g. SQL Server Express Edition does not support supplemental logging of additional columns). Set-up of real-time data integration in HVR is simplified by a comprehensive GUI that takes care of all necessary steps to perform log-based CDC including the setup of supplemental logging as needed.

To learn more about log-based CDC contact us! We love to discuss all things data replication.

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Discussion

4 Comments

  • Lancelea Lorrennzo

    Very good, actually helped me to clear out my confusion to figure the other types of change data capture.

  • Melody Just

    Interesting! Who are the best vendors for CDC please?

    • Mark

      This is a great question. Let me give you a very high level overview of what is in the market, and encourage you to do your own research to identify what is most important to you and make your own qualified decision.

      • First of all there is of course HVR. With HVR you get an end-to-end data integration solution that provides table creation, initial data load, log-based CDC on a wide variety of platforms (see https://www.hvr-software.com/product/platform-support/), compare/repair capabilities, and a Graphical User Interface with the ability to set up automated alerts and get insights into the continuous data flow.
      • A second CDC vendor with probably the widest variety of supported CDC platforms is Attunity Replicate. Like HVR Attunity supports the ability to create tables and perform an initial data load that is aligned with CDC. In contrast to HVR Attunity heavily promotes an agent-less architecture with the benefit to avoid agent installations, but compromising performance and arguably introducing a bottleneck, especially in environments with multiple sources (a discussion on benefits versus disadvantages for the use of agents is on our blog at https://www.hvr-software.com/blog/data-integration-architecture/)
      • Oracle is a player in the heterogeneous data integration market with the Oracle GoldenGate offering. Prior to Oracle’s acquisition in 2009 GoldenGate was an independent company and technology before that used to focus on heterogeneous platform support. Since the Oracle acquisition Oracle has naturally focused on improving support for the Oracle Database and less so on supporting heterogeneous platforms. Oracle provides many data integration technologies that combined provide all functionalities mentioned for HVR, but setting up such end-to-end solution requires multiple downloads, and separate licenses for the individual components.
      • Another CDC vendor is Quest with a technology called SharePlex. SharePlex was originally developed to support Oracle to Oracle replication, but more recently the technology branched out in supporting more heterogeneous environments also. SharePlex was part of Dell’s technology portfolio for a few years between 2012 and 2016. Architecturally a big difference for Oracle to Oracle replication between SharePlex and most other CDC technologies is that SharePlex will propagate and start applying uncommitted transactions to the target system to achieve lowest possible latency, even when there are long-running transactions. Other technologies, including HVR, wait for changes to be committed before applying them on the target system (in source system commit order but with optimizations to achieve maximum throughput).
      • DbVisit is another CDC player. Their technology is particularly strong in Oracle to Oracle replication with a focus on high availability scenarios for the Oracle Standard Edition Database (for which Oracle does not support the native Oracle physical standby (Data Guard) solution).
      • Striim provides CDC capabilities for some database technologies, typically built on top of the native database CDC technology (which in some cases has license implications).
      • Informatica is a well-known data integration player that has been focused on data integration in the cloud. Their technology Informatica Data Replication (IDR) is another CDC technology for heterogeneous environments.
      • Amazon Web Services (AWS) provide a Data Migration Service (DMS) that – based on its name – focuses on data migrations but provides more and more general data replication technologies and with that CDC capabilities for heterogeneous scenarios.
      • StreamSets primarily builds on top of native database CDC technologies to perform CDC.
      • In 2010 BackOffice Associates acquired HiT Software that specialized in CDC and continuous data integration with an original focus on DB2 systems. The DBMoto technology is still available for CDC in heterogeneous environments.
      • SAP acquired Sybase, which supported CDC through the Sybase Replicator. Of course SAP has been focused on the HANA platform, and its bread-and-butter revenue generating ERP offering with its move to the cloud, but SAP Replication Server (formerly Sybase Replicator) is still available CDC technology.
      • In 2007 IBM acquired DataMirror with a focus on heterogeneous CDC. The technology is still available today as IBM InfoSphere Data Replication.
      • Most Database technologies provide a native solution for CDC-based data replication in dominantly homogeneous environments. For example SQL Server, PostgreSQL and MySQL provide such technologies. Oracle also still supports a Logical Standby.
      • There are other, less well-known CDC players in the market such as Gamma Soft that focuses on the French market.

      At HVR we continue to focus solely on CDC and continuous data integration. Across our team of technical experts we have hands-on experience with most of the commonly-used CDC technologies.

© 2018 HVR

Live Demo Contact Us