This post is the first in a series of five to discuss data replication performance.
This blog post focuses on performance objectives and general aspects of data performance. In the second post, we discuss HVR’s performance architecture. The remaining three posts provide detailed considerations for one-time load, replication, and data validation.
The IT industry tried to address questions like this in the Transaction Performance Council (TPC). Based on a strict benchmark specification, hardware and software vendors tried to showcase their technologies for different workloads. There is now a TPC-DI – Data Integration – Benchmark focused on ETL/ELT, but no results have been submitted.
A benchmark is, of course, generic and may or may not reflect your environment or workload mix. There is no industry standard data replication benchmark.
Data Replication Performance
Data replication performance is often measured in throughput: how many GBs of transaction log volume, or rows, or transactions can be processed in a time window. For an initial load or data validation run: how many rows per second can we load or compare?
Data replication performance has many dependencies:
- How many tables are you replicating?
- What are the table definitions (number of columns, data types)?
- What type of load runs against the tables? e.g., does the transactions mix consist of many small transactions in a multi-user workload versus fewer, more batch-like, larger transactions? What is the ratio of inserts versus updates versus deletes?
- What part of the load on your system are you not replicating?
- What is your system configuration in terms of CPU power, memory configuration, and storage speed?
Bottom line: a data replication performance measurement is unique to your situation.
For ongoing data replication, maximum latency is a common business goal organizations strive to achieve. Of course, the related performance goal in MB/s is different for a system generating 10GB of changes in an hour than a system generating 100GB of changes. For most systems, transaction activity fluctuates over time, and to achieve a business goal of maximum latency, the performance goal fluctuates as well.
You perform the initial load only once and may need to balance resource utilization with ongoing transaction processing. If so, the initial load’s overall duration is arguably less important than the limited impact on the ongoing workload, or efficiency. However, if you plan to perform the initial load from a standby database and stop recovery for its duration, you may aim for the minimum duration and maximum performance.
If you plan to compare your data regularly then you may have a time window in mind for the data validation to finish. For example, data validation should be completed between Saturday 8 PM and Sunday at 4 AM.
System impact should also be a factor in your replication performance. What if you can achieve a maximum five-second latency business goal, but the cost is 25% overhead in system processing? Maybe up to 10 seconds of latency at less than 10% overhead is a more attractive goal. Similar considerations apply to the initial load and data validation.
Efficient software, running at maximum performance, uses 100% of a CPU core per worker thread. i.e., a single-threaded program/operation should use 100% of a CPU core to achieve maximum performance.
Maximum processing performance can only be achieved if there is sufficient work to be done. At 10GB of changes in an hour, the data replication software may not run at maximum performance to achieve a business goal of a maximum of five seconds latency.
Maximum performance is achieved for a period of time if there is lots of work to be done. For example, if replication has to catch up from a backlog. Or one or more large tables must be loaded or validated. In these kinds of scenarios, to achieve maximum performance, CPU cores should be 100% busy.
Different CPUs deliver different performance. Clock speed (both base and turbo frequency), and other performance attributes such as cache size and bus speed vary from CPU to CPU. Generally, CPU performance is proportional to its maximum Thermal Design Power (Wattage). A CPU – from the same generation – designed for 165W TDP will perform faster than a CPU designed for 95W TDP (by a factor of about 1.7).
Find the Bottleneck – Tuning
In a data replication or data integration scenario, multiple processes work hand-in-hand in a pipeline. For example, consider an initial load between two databases by a data replication solution such as HVR. One or more database processes serve the data to the data replication product that delivers the data (possibly straight) into the destination database using one or more target database processes.
Any of the processes in a pipeline may reach 100% CPU utilization, in that case, maximum performance is reached for the configuration. You may still be able to improve performance by changing the configuration. For example, whether you compute a derived column value on the source or the target may result in different performance, even though in either case you may achieve 100% CPU utilization on one of the processes in the pipeline.
You may also see that none of the processes in the pipeline reach 100% CPU. In that case, there is a bottleneck preventing faster performance. For example, during the initial load, if the source database system files have low storage bandwidth (and due to IO waits), maximum performance cannot be reached. Or perhaps insufficient network bandwidth between source and target is limiting performance.
If you observe less than maximum performance then you should look for the bottleneck, and see if it can be eliminated as the performance-limiting aspect.
Optimizing your data pipeline for maximum performance is called tuning.
To parallelize processing is a common tuning technique. Irrespective of whether you achieved maximum performance on a single pipeline, you may be able to improve overall performance by running in parallel.
Running in parallel may result in a shifting bottleneck. For example, you may have achieved maximum performance moving data between on-premises and the cloud with a single table load. However, once you started running with multiple parallel streams, you reached the upper limit of network bandwidth.
Increasing parallelism once you have reached an aggregate bottleneck is not useful. In the best case, there is no slowdown due to incremental parallel processes beyond the aggregate bottleneck. Look for the bottleneck and eliminate it before increasing parallelism again.
Parallelism is a brute force approach. Unless you run in a dedicated environment, you may have to settle for less than maximum performance to limit resource utilization.
HVR Performance Architecture
HVR is built to achieve maximum data replication throughput. In the next blog in this series, we’ll discuss HVR’s performance architecture in detail.