Data Replication: Performance Architecture
In the first blog of this series, I shared general concepts about data replication performance. This second post is about recommended performance architecture when deploying HVR. I will use real-world analogies to illustrate why the architecture is ideal for optimal performance.
The remaining three blogs in this series will go into more detail on initial load, Change Data Capture (CDC), ongoing replication, and data validation performance.
HVR is designed to run in a distributed architecture, with multiple HVR agents and an HVR hub.
An HVR agent is an installation of HVR that typically resides close to the source or destination data store. The HVR hub is also an installation of HVR. The hub stores data replication metadata as well as runtime state, and instructs the agent installations. The HVR administrator manages all data replication flows through a GUI connected to the hub.
HVR’s software is designed to be modular. Any installation of the software can play the role of agent, hub, or both.
In addition to improved security, the distributed architecture delivers important performance and scalability benefits.
The source HVR agent filters data before sending it down the data pipeline. Any filter or projection on source tables during the initial load is pushed into the source database. Log-based CDC identifies net changes from the database transaction log, of which only a subset is relevant for replication.
Consider the following real-world analogy of the distributed approach: let’s say you want to buy all of the blue balloons a store has to offer. Would you order all of the balloons and discard the ones that are not blue? Or would you ask the store (the agent) to only send you the blue balloons? Naturally, the second approach is ideal because it results in fewer balloons being sent to you.
Filtering data close to the source improves efficiency.
HVR agents perform some of the more resource-intensive processing to avoid creating a bottleneck at the hub.
Going back to our real-world analogy: since the store sent you the blue balloons, you don’t have to do the work of discarding the balloons that aren’t blue. Imagine you ordered blue balloons from 10 stores. Every store has to do some of the resource-intensive filterings, but this is not an extreme amount of work for each of them. On the other hand, if you received all of the balloons from all 10 stores, you would have a great amount of work cut out for you.
Distributing work through agents results in a scalable setup.
The HVR agent compresses data before sending it. Sending compressed data across the wire requires less bandwidth and/or fewer data packets. HVR commonly achieves 10x or higher compression ratios. Data is only decompressed when it reaches the target agent.
Back to our real-world analogy. What if the store sent deflated balloons? Sending inflated balloons would require multiple deliveries and take a lot more time.
Compressing data before sending it magnifies the available bandwidth. Always use an agent for communication over a Wide Area Network (WAN), for example, between on-premises and the cloud, to take advantage of compression.
Large Data Blocks
Communication between systems is subject to latency. Communication protocols require an acknowledgment to ensure data was received correctly. The amount of latency is proportional to the distance between systems. The latency between an on-premises data center and the cloud is much higher than it is between two servers in the same data center.
The HVR agents and hub communicate using large data blocks. With large block transfers, there is less back-and-forth communication than there would be with smaller blocks.
In our real-world example, let’s say the store sends us blue balloons in bags of 100, not as individual balloons.
Large data blocks limits back-and-forth communication and is less sensitive to high communication latency. A remote database connection can be an alternative approach to communication, but it may or may not be equally efficient.
Two key metrics determine network performance:
1. Bandwidth: how many bytes can be sent per second?
2. Latency: how much time does a network roundtrip take?
On relatively high-latency, Wide Area Networks (WANs), latency may limit bandwidth utilization for an individual process. The roundtrip for the acknowledgment, combined with the amount of data sent in the communication, results in below maximum bandwidth utilization.
HVR agents and hub buffer communication and bundle the acknowledgment. This approach results in fewer network round trips. Only in the rare case that the network regularly drops packets that must be re-sent does this approach result in less efficient communication.
Our real-world equivalent is that the store sends multiple bags of balloons in a box, rather than every bag on its own.
Buffering allows for improved bandwidth utilization, especially on high latency networks.
HVR agents store minimal state, with the hub in control of the replication. This setup simplifies a workload split across multiple agents.
Consider the common example of a data warehouse or data lake, consolidating data from multiple sources. A single destination agent may struggle processing changes from all sources. Multiple agents help distribute the load.
In our real-world example of balloons, imagine our final farewell to the balloons is to inflate them with helium and let them go. We could complete this task alone. We could ask one friend to help. Or we ask multiple friends to help.
Flexible parallelism with agents improves performance.
HVR’s Distributed Architecture
Leveraging HVR’s distributed architecture is optional. HVR fully supports agent-less operations. However, compared to an agent-less setup, the distributed architecture provides security benefits and multiple performance and scalability advantages.
Agents are still effective if they don’t reside on the server processing the data. Nowadays, customers commonly use a cloud-based database service that does not allow installation on the database server. An HVR agent in the database server’s availability zone still takes advantage of the distributed architecture’s performance and scalability benefits. The database connection that may be sensitive to higher latency is local.
All HVR operations that move data, including initial load, ongoing replication, and data validation, benefit from these performance features. Always use agents when communication is sent across a Wide Area Network (WAN), for example, between on-premises and the cloud, or between clouds.
Subsequent performance blog posts go into more detail on performance optimizations and aspects of initial load, replication, and data validation.