Mark Van de Wiel Technology Background

“Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.”

Use cases for database replication include high availability, the geographical distribution of data, migration between systems, technologies and/or the cloud, and data consolidation for analytics in reporting systems, data warehouses, and data lakes. Your use case can help determine which database replication is best.

Physical Database Replication

Physical database replication, also sometimes referred to as block-based replication, uses a binary format to keep an exact database copy in sync with the primary database. Mature databases typically provide physical database replication, supported and maintained by the vendor/provider because it is an extension to the technology’s core database recovery capabilities.

Using the binary format for database replication provides completeness: the standby database is an exact copy, including all tables, indexes, stored procedures, users, etc. Due to the binary form of replication, there are limitations:

  • The database type and version between source and target must be identical, so physical database replication provides only homogeneous replication.
  • Only the entire database can be replicated, and it is not possible to replicate only a subset of a database. Likewise, it is not possible to consolidate data from multiple databases into a single data store.
  • It is not at all possible to transform data.
  • At best, and in some cases only at an extra license cost, the standby database is available for reads; performing changes in the replicated database is not possible.

Based on its strengths and limitations consider physical database replication for disaster recovery and high availability scenarios, and to be a read-only reporting system of a single database.

Logical Database Replication

Logical database replication uses a logical approach to replicate data, in most cases through the use of SQL statements. Logical database replication provides benefits where physical database replication has its limitations.

With logical database replication:

  • Heterogeneous database replication, across operating systems, database versions, and database (or even data store) types, is possible.
  • Replicating a subset of a database, or even a subset of a table is possible.
  • Transformations between source and target can be performed.
  • The destination database for data replication is open for reads and writes, providing for a flexible topology including multi-master, active/active data replication scenarios.

The flip side of this flexibility is the area where physical data replication shines. Generally, logical replication has limitations in types of database objects that can be replicated (e.g. triggers, stored procedures, database users) and in some cases, certain data types cannot easily be replicated.

Consider logical database replication:

  • To replicate data into consolidated reporting systems, data warehouses, and data lakes.
  • In some high availability scenarios where physical database replication falls short, for example, due to the heavier bandwidth utilization to send all change data (including indexes, etc.) to the standby system.
  • If there is a need for active/active, multi-master replication, for example for geographical distribution or in a high availability/disaster recovery scenario where you cannot afford the time needed to (re)start a database.

There are several types of logical database replication that you can choose from, depending on your need. Below are trade-offs and considerations on which method to use when.

Full Reload/Refresh

The first method of logical database replication is to perform a full reload or a refresh of the data at an interval. This approach is resource-intensive in multiple dimensions: (burst) load on the source, network utilization, and (burst) load on the target.

Consider this method for reporting and analytical use cases if the volume of data is limited, data does not change frequently, and if there is a convenient time window to perform the refresh (e.g. nightly, when the load on the systems is low).

Partial Reload/Refresh

The second method of logical data replication is to use a column in the table that is (reliably) modified for every change to the row (e.g. a last_updated timestamp). Use a filter when retrieving the data from the source instead of always selecting all data.

This partial reload/refresh approach is also applicable to reporting and analytical use cases, and can be used for larger data sets than a full refresh method. However, this approach is only reliable when data does not get physically deleted. Similar to the full refresh method, consider a partial refresh only if there is a convenient window to perform the partial refresh.

Change Data Capture (CDC)

Icon use case cobalt - change data capture + real-timeThe third method of logical database replication is to perform Change Data Capture (CDC). Log-based CDC, in which changes are captured asynchronously from a database’s transaction log, is widely considered the preferred method.

An alternative is a trigger-based capture, recording changes in separate tables by triggers that fire as row changes are introduced. A separate extraction process subsequently retrieves the changes. Disadvantages of trigger-based capture include the direct impact on the primary transaction making the changes and the fact that certain operations on most databases such as direct path inserts or truncates, bypass the triggers, and as a result changes could be missed.

Log-based CDC, on the other hand, uses the same transaction log that the transaction processing database uses for its own recovery. Because of this, all changes are recorded, and by asynchronously reading the log, committed changes can be propagated within moments of transaction completion, resulting in very low latency between source and target.

CDC, in particular log-based CDC, is the flagship of logical data replication approaches. Consider the use of CDC in for database replication if any one of the following conditions applies to your use case:

  • Your database processes high volumes of changes, 24×7.
  • Data must be replicated with (very) low latency.
  • You want to avoid overhead on the transaction processing database as much as possible to protect database performance.
  • There is a need for multiple active systems, for example for geographical distribution, or because in a high availability scenario you cannot afford any downtime.
  • Transactional data is needed in a data streaming use case, leveraging technologies like Kafka, Amazon Kinesis, or Azure Event Hub.

At HVR we help customers replicate data in heterogeneous environments. Log-based CDC is supported on commonly-used relational database technologies, running multitudes of different applications, including off-the-shelf ERPs from the likes of SAP, Oracle, and others.

Organizations use our technology often to synchronize cloud-based reporting and analytical environments or to consolidate data from multiple sources in data warehouses and data lakes, with updates flowing continuously in near real-time. To see the set up of a replication scenario and experience our log-based CDC capability, I invite you to check out the test drive.

Test Drive HVR

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Test drive
Contact us