Mark Van de Wiel Change Data Capture and Database Processing

Change Data Capture Impact on Database Processing

Clients often ask: “What is HVR’s impact on database processing?” The question is simple, the answer, not so much. There are, however, a number of considerations.

Supplemental Logging

Icon_databaseMost databases focused on transactional database processing write changes to a transaction log to ensure they can meet the availability and consistency requirements prescribed by principles of ACID (Atomicity, Consistency, Isolation, and Durability) database processing, going back to the early 1980s. Over time the database technologies have been optimized to only write the minimum amount of information to the transaction log to meet the requirements.

In order to perform logical log-based replication, in which changes are read directly from the transaction log to be delivered to another system where the change may be replayed, the minimum amount of information in the transaction log is generally insufficient. Specifically, updates require key information–to identify the row on the target system–that is often not included in the transaction logs. So-called supplemental logging instructs the database to include extra information in the database logs for updates. Different databases have different means of enabling supplemental logging. For example, Oracle supports explicit alter table command to enable supplemental logging, in SQL Server the addition of a CDC table on a base table implicitly activates supplemental logging, etc.

SAP HANA is an exception that supports no supplemental logging at all, and HVR implemented a separate approach to perform CDC from this database.

The addition of supplemental logging will impact primary database processing because it results in more transaction log being generated. The amount of extra impact on the database depends on:

  • The number of tables that will be replicated, since only these tables need supplemental logging.
  • The ratio of updates to the tables relative to inserts and deletes on the tables (that are fully logged anyway).
  • The byte length of the key columns relative to the byte length of the typical change to the tables.

Initial Load/Refresh

Icon TableThe initial data load runs SQL full table scan select statements against the source tables to refresh the data in the target tables. The impact on the source system for this operation generally depends on the volume of data moved, combined with the number of tables refreshed in parallel. HVR automatically aligns the refresh and Change Data Capture (CDC).

Impact on the source for the refresh operation can be eliminated altogether by running the initial load from a copy of the primary database or a standby database. Take extra precaution to avoid loss of data when performing the refresh against anything but the database CDC runs against (i.e. the primary or a standby database).

Change Data Capture (CDC)

To ensure uninterrupted, low-latency CDC, capture must be running faster than the database is writing the logs. HVR’s direct read and pipelined execution ensures optimum efficiency to keep up with the database log writers. As a result, when capture runs continuously, it will be capturing from the tail end of the log where the log writer(s) are writing. While clients occasionally raise concern for contention on the login practice there is none, because the busy tail of the log is generally cached, either on the file system or in the case of Oracle ASM in ASM instance buffers.

HVR CDC processing takes up CPU, memory and IO resources. HVR will always retrieve the changes in system commit order irrespective of whether the database is a single instance or a clustered database, with one reader per thread, and for a clustered database there is a coordinator to ensure changes are put in order. Per reader, HVR will at most use one CPU core when running behind, which in practice, on a modern, adequately-sized database server, means there is very limited if any noticeable impact on database processing due to this extra resource consumption.

Memory utilization is also limited with very modest (configurable) thresholds that, if exceeded, result in large transactions being temporarily written to a local file system. With default settings capture rarely uses more than low single digits GBs of memory, although configuration changes can be made to depending on the workload trade higher memory consumption for lower IO usage.

To limit recovery time, HVR capture will, at an interval (configurable) write state information to disk, either on the source capture server or on the hub. The amount of data written depends on the load on the system, dominated by the volume of outstanding changes that should be replicated in long-running transactions. In practice, we have not seen this checkpoint slow down database processing.

The impact of CDC processing on the database can be completely eliminated by capturing from a standby database (available for multiple databases including Oracle and SQL Server), or by capturing in a separate environment from only backed up transaction logs or archived logs. Note the so-called archive-log only mode introduces extra latency relative to capturing from the live or the standby database.

Interested in learning more?

Related Resources:
Change Data Capture for Real-time BI and Data Warehousing
ETL + Log-Based CDC = Maximum Value
The Major Types of Data Integration

Join one of our live demo webinars.

Register for a Live Demo

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Test drive
Contact us