Data Replication Performance: Capture and Integration
Achieving maximum data replication performance is not easy. You run into many dependencies. Some are due to the environment and related to the source technology or target technology. Others are related to what you want to replicate. And finally, how you replicate the data can impact performance.
HVR’s technology provides built-in defaults and best practices to help you achieve the best possible data replication performance. But sometimes the technology needs help, and this is where you come in.
This blog post goes into detail on aspects that affect maximum data replication performance. The two related but decoupled aspects to this are capture performance and integration performance.
In all cases, the goal of data replication is to keep up with the transaction volume on the source.
HVR customers keep up with transaction processing rates on systems that generate well beyond a terabyte of transaction log changes on a single database, with peak transaction loads into the hundreds of gigabytes per hour, per database thread (node), in an Oracle RAC (Real Application Cluster) configuration.
Performance, Efficiency, Impact
Related to performance and efficiency is impact. Customers often wonder what is the impact of HVR processing, especially on the source transactional system.
Many factors influence resource consumption, including:
- What percentage of the transaction log is relevant to replication?
- What is the transaction mix between long/short-running transactions?
- How many tables and columns are replicated, and what are the data types?
Performing log-based Change Data Capture (CDC) from a database server with x amount of resources (CPU, memory, IO) generally takes well below 10% of x resources, with an average around 5% of CPU resources. For example, a system with twelve cores would require less than one core of processing resources to perform log-based CDC.
Note: For a clustered database like Oracle RAC, you must add resources from all nodes.
Fundamentally HVR’s replication maintains consistency. The source database’s integrity state is carried into the target technology provided the destination supports the concepts of a controllable cross-table transaction. Most database targets support this.
Technologies that do not support such a transaction concept, like file systems and Kafka, can use a data publication mechanism. When orchestrated through HVR, a solution using data publication can still maintain transactional consistency.
The initial goal for achieving maximum performance is to maintain transactional consistency. Explicitly introducing data replication parallelism breaks consistency unless an aggregate data publication mechanism is introduced.
Why maximum performance?
Keeping up with transaction log generation is the first obvious goal for real-time replication. Beyond that, why should you aim to achieve maximum performance?
The two main reasons are:
- If replication is behind for whatever reason, then you can recover quicker.
- Many of the changes you deploy to get maximum performance improve efficiency, lowering the impact on the systems involved.
Maximum Change Data Capture performance can be measured by running replication on a backlog of generated changes. Likewise, maximum data integration performance can be measured on a backlog of captured changes.
HVR runs separate capture and integration processes. Optimizing them is done independently. Compressed (and optionally encrypted) transaction files are queued on the hub. Over time integration cannot be faster than capture. Integration never slows down capture.
Change Data Capture (CDC)
HVR supports log-based CDC on all different types of supported source database technologies.
Many organizations run the database on a (set of) server(s) they control, either on-premises or in the cloud. Others utilize database services. For Software as a Service (SaaS) applications, you probably don’t know how or where data is stored.
HVR technology focuses on achieving maximum log-based CDC performance through direct database access, on-premises, or in the cloud. At present, HVR provides limited support for SaaS applications.
Depending on the database technology, HVR provides multiple methods to access the database transaction logs.
Direct, file system level, transaction log access to:
- Online logs, for real-time CDC.
- Up to date logs on a standby database, still enabling real-time CDC.
- Backups/archives of the online logs only, with latency depending on the backup/archive frequency.
SQL-based access to transaction log fragments for real-time CDC.
Please check the HVR documentation to determine for your database type the options available.
For most databases, direct access to the transaction logs on a file system is faster and more efficient than access through SQL. Access through SQL introduces extra validations, and with that latency, in the SQL layers. Access through SQL also often limits the amount of data HVR can consume in a single IO request.
For some databases like DB2, the API call to read the log returns an already parsed result, whereas other databases like SQL Server return raw log increments. HVR’s capture technology translates the input it gets from the transaction logs, irrespective of its format, into changes that apply to tables.
Identifying changes in the transaction log stream is a CPU-intensive process. HVR achieves maximum performance when capture processes reach maximum CPU utilization. Not all transaction log input is relevant to replication. Discarding irrelevant transaction log fragments takes fewer resources than pulling out relevant changes.
Transactional Change Data Capture
Database transaction logs enable a database to recover to its most recent committed – consistent – state. HVR relies on this principle as it performs log-based CDC.
By default HVR, maintains transactional consistency during replication. As a result, the replication destination maintains the same level of data integrity as the source.
Transaction processing databases use the concept of a commit sequence number to uniquely determine the order of transactions. For example, there is the System Commit Number (SCN) in Oracle and Log Sequence Number (LSN) in numerous other databases, including DB2, SQL Server, and PostgreSQL. This commit sequence number is used to implement industry-accepted ACID properties of database transactions
HVR maintains transactions in commit sequence order throughout the replication flow, including the order of changes in the transactions. Like database technologies that process the changes, HVR requires a point of serialization to maintain this order that can limit performance. However, log parsing is less complex than database processing, and in practice, HVR’s capture can generally outperform transaction log generation.
For a clustered database like Oracle RAC, HVR parses the log per database thread (node), with a coordinator sequencing transactions in commit order. In such a configuration an individual thread parser could limit performance or the coordinator.
Maximum CDC Performance
Maximum CDC performance is achieved when HVR is processing transaction log changes at maximum CPU utilization. This can be observed at the operating system level as an HVR capture processing using 100% of a single core.
Note: Different operating systems visualize maximum CPU utilization differently.
Specifically, on a Linux-based system, a busy core shows as 100% in top sessions. In the Windows Task Manager, 100% means all cores are fully busy, so a single process can only reach 100/xth of CPU utilization with x the total number of cores available to the system.
HVR’s processing is optimized for processing high volumes of data. Efficient access to the transaction logs is critical to keep busy processing logs.
Due to the different methods of access to the transaction logs per database, HVR uses various techniques to optimize access to transaction log increments. These techniques include asynchronous access to log fragments and (large) buffered reads.
To access logs in the file system cache is generally faster than performing direct IOs to disk. However, depending on the database configuration file system cache may be bypassed by the database processes, forcing HVR to perform direct reads from disk to maintain consistency.
HVR supports many different data types natively, and by default, all data will be replicated. Refer to HVR’s documentation to review the details of data types HVR supports for different technologies.
In some cases, HVR has to fall back to fetching data. To enumerate, if the database processes partial LOB operations or to replicate non-natively supported data types (see Extended Data Type Support).
Fetching data lowers throughput, with the impact of fetching proportional to the number of records that must be fetched. An index on the table, for example, on the primary key if there is one, can lower the slowdown due to fetching.
As HVR processes the transaction log, it keeps track of all relevant table changes in memory, with memory allocated per transaction. Only if—configurable—memory allocation is exceeded will a transaction spill to disk. By default, the spilled data is compressed to limit storage utilization due to disk spilling.
HVR only propagates committed transactions.
With transaction information kept in memory, the state of in-flight changes is lost in case the process restarts for whatever reason. To limit the amount of transaction log to be rescanned upon restart, HVR will periodically – by default every five minutes – write checkpoints to disk. If deemed valid, the checkpoint will be the starting point instead of the beginning of the longest-running open transaction tracked when the process stopped.
HVR uses transaction files for replication. A set of captured changes is streamed by the source capture process, compressed and optionally encrypted, and stored as a transaction file on the hub for downstream processing. A transaction file contains one or more transactions. A single transaction, processed by a single capture process, will never span multiple transaction files. To maximize efficiency, HVR will bundle transactions if possible.
Transaction files are stored compressed and optionally encrypted on the hub for recovery reasons, and to decouple capture and integration.
The HVR capture log references capture cycles. The cycle number increments by one every time HVR reaches the end of the database transaction log, and it has sent committed changes to the hub.
If HVR capture runs behind, changes are sent to the hub in ten-second increments, provided a transaction boundary can be identified, and another sub-cycle starts. The (configurable) ten-second increment ensures changes continue to flow through the replication stream.
Transformations and Filters
HVR relies largely on database SQL and function calls to process transformations and filters. During replication, it is typically more efficient and recommended to process transformations or filters on the target integration side.
Processing transformations and filters on the target may not always be possible:
- If the target is not a database but a file system or streaming data pipeline.
- If data must be filtered/transformed on the source e.g. because the data cannot leave source systems (unmasked).
The impact of transformations and filters is proportional to the number of rows that must be transformed/filtered.
Parallel capture can be considered as a last resort to improve performance. Capturing in parallel provides only limited performance improvement because:
- To ensure no changes are missed, every capture process must read the entire transaction log stream. Of course, fewer changes are relevant and to discard irrelevant transaction log fragments is faster than parsing out changes.
- All capture processes read the same redo stream, which can lead to contention. File system cache largely mitigates this concern.
An integration process on the hub picks up compressed transaction files for downstream processing. Depending on configuration settings, including the integration method and the maximum transaction file size to be processed at once (cycle byte limit), the integration will process all outstanding transaction files in a single cycle.
Integration can put a heavy load on the destination technology. Optimizing performance for change delivery on the target technology will generally benefit HVR.
Continuous and Burst Modes
Change integration into a relational database can be performed in two dominant modes:
- Continuous mode, replaying every change from the source in the same order. This mode is useful between well-indexed transaction processing databases that may enforce integrity constraints. Continuous mode is always used in active/active scenarios. Continuous integration is very sensitive to high network latency between the HVR integration process and the target database.
- Burst mode, staging changes in burst tables, and processing changes as set-based operations against the target tables. To process changes in burst mode, HVR first computes a net change per row, as identified by the row identifier. This process is called coalescing and requires data to be sorted. Coalesced changes are then loaded in bulk, using technology-specific best practices, into the burst tables. Burst tables are emptied at the end of the burst cycle.
Integration into other technologies, including file systems like AWS S3, ADLS or GCS, and streaming data pipelines like Kafka, follow similar considerations, although changes are always new entries.
Burst mode is typically used for analytical database targets. Per-row efficiency increases significantly as more rows are processed in a single integration cycle. As a result, burst mode can typically maintain change data throughput, as long as cycles include enough changes. This process balances itself, with longer cycles leading to more outstanding changes, resulting in more efficient data integration.
However, larger cycles result in higher latency. Also, to maintain transactional consistency, HVR will process the final integration of changes to the target tables in a single (serial) transaction, which results in a longer cycle with more active tables in the integration.
The coalesce step for burst integration requires memory-intensive sorting. Depending on change volumes processed, you may see a warning in HVR’s log indicating you can improve performance by allowing higher memory allocation for sorts. Only consider extra memory allocation if such memory is available. HVR writes intermediate sort files to disk when the memory threshold is otherwise exceeded. It is much more efficient for HVR to spill to disk than to swap memory pages (to disk), should memory be overallocated.
To perform replication, HVR must have a way to identify a row. If the table definition has a primary key or a unique key with at least one mandatory column, then the column(s) from the database key are used. Without a reliable database key, HVR will use all non-LOB columns to identify a row. It is possible to define a surrogate key and capture, for example, a rowid, although depending on the database technology, this still may lead to inconsistencies.
The row identifier is important for integration performance:
- In continuous mode, a unique index on the row identifier can dramatically speed up integration performance.
- In burst mode, the row identifier becomes the sort key for coalescing.
HVR supports replicating tables with no database keys, even if – based on the HVR row identifier – there can be duplicate rows in the tables.
Some databases like Teradata, Greenplum, and Redshift support a distribution key. The distribution key determines how data is spread across parallel database processes. You must choose a good distribution key to ensure roughly equal distribution across parallel processing units and achieve good scalability.
By default, HVR will record the primary key column(s) to be the distribution key. In some cases, it is not useful to use every primary key column in the distribution key, and the database may have a limit on the number of columns in a distribution key. For instance, Redshift supports only a single column for the distribution key.
Review the distribution key settings in HVR, modify them as needed, and use action TableProperties to adjust the algorithm to derive the distribution key.
HVR will use the distribution key settings to create burst tables to achieve optimum parallelism while processing inserts, updates, and deletes against the target tables (that use the same distribution).
HVR’s replication maintains transactional consistency throughout the replication. If there is a replication error, then by default, HVR will stop replication. This is to prevent subsequent failures as a result of the earlier issue cascading.
Specifically, if replication failed to insert a record into the CUSTOMER table, then recording an ORDER for this CUSTOMER record would/should also fail. By default, HVR expects replication issues to be resolved before allowing integration to be resumed.
Resilient processing allows replication to continue despite (perceived) inconsistencies in the data between source and target. The resilient mode is meant to deal with an overlap in the state of the data and the changes coming in. To illustrate, suppose you used a utility external to HVR for the initial load. In that case, then you expect to use resilient processing during the overlap period between change data capture and initial load.
Some regular replication cases also require resilient processing. For example:
- A target system with enforced cascading constraints must have resilient deletes enabled. Deletes on the parent table will result in the child row getting deleted. HVR’s subsequent explicit delete from the child table would fail.
- When using a soft delete on the target – to mark deleted rows instead of deleting them – and the source application reuses the key value.
In resilient mode:
- If HVR expects to update a row, but the row does not exist, it will insert the row instead.
- If – based on the row key – the row already exists on the target and an insert must be processed, then instead the existing row is updated.
- If a delete must be processed, but the row does not exist, then the delete is ignored.
Resilient processing results in minimal overhead. The exception is a table being replicated with all of the following conditions:
- The table has no primary key.
- The table has a lot of rows.
- Integration runs in continuous mode.
- Inserts (lots of inserts) are processed.
- Resilient mode is enabled.
For this case, HVR cannot rely on a duplicate value in the index to be reported. Instead, a pessimistic approach to resilient processing results in first querying whether the row already exists – likely resulting in a full table scan – followed by an update or insert.
Changing any one of the attributes will resolve the performance issue. Realistically the options are:
- Find and record a primary key, and create the index. HVR creates the index if the key is recorded in the table definitions and the table is (re)created by HVR.
- Find a (combination of) mostly unique column(s) to create a non-unique index.
- Switch from continuous mode to processing in burst mode.
- Disable resilient processing.
Beware that following an Online Refresh operation, HVR will temporarily enable resilient processing through a control.
HVR data integration processes lots of changes in the destination technology. Always review the load profile on the target to maximize data integration performance. Follow technology-specific best practices to maximize data integration performance.
Experts in the specific technology can be very helpful in identifying opportunities to tune performance.
Integrate performance can often be improved through parallelization. At present, you must manually configure a parallel setup, and changes processed in different integration processes are not transactionally consistent.
You can process changes for different tables across separate integration groups. Data from a single table can be parallelized across integration groups using action Restrict with options AddressTo and AddressSubscribe.
Ensure the target technology can handle parallel processes the way you configure them, and make sure there is no overlap in data processed between the processes to avoid data clashes. Also, consider technology attributes like database locks that can interfere with parallel processing.
Maximize Performance / Minimize Latency
Achieving maximum replication performance helps minimize latency should a backlog be incurred. However, strategies to minimize latency may lower throughput due to inefficiencies.