Data Integration Architecture
Data Integration Architecture: Understanding Agents
To use agents, or to not use agents, that is the question
The question of whether or not to use an agent when performing data integration, especially around use cases with log-based Change Data Capture (CDC) and continuous, near real-time delivery, is common. For the purpose of this discussion, I define an agent as an installation of software local to the end point of the data—e.g. the database, or the file system—to perform its task of either capturing or delivering data. This blog post examines the benefits and challenges of data integration architectures with agents and without. I also provide tips on which approach to consider for your use case.
Data Integration Architecture: The Case for Agent-less
Organizations initially like the idea of agent-less setups for their data integration architecture because:
- Lower complexity: both for the initial setup, during configuration, and for long-term management.
- Lower impact on the database server(s).*
- Concerns, or simply not being able to install third-party software on the source or target environments.
*Note that lower impact on the database server(s) is a marketed perception that is often not true. Consider log-based CDC on any popular transactional processing like Oracle. To perform agent-less CDC the database will have to serve the changes through SQL or procedural calls which require database processing resources that would not be used if there was no CDC requirement. i.e. while there is no external program using resources on the server, the introduction of CDC does, in fact, increase resource utilization (for some databases more significantly than for others).
HVR supports agent-less setups for most combinations of sources and targets.
Data integration Architecture: The Case for Agents
The use of agents can have many benefits. At HVR we recommend the use of agents because:
- When the HVR executables talk to each other the communication is optimized for high latency, low-bandwidth network connections. Data is compressed by default and communication uses large blocks to magnify network bandwidth and be least sensitive to high latency. In HVR’s case the initial data load (refresh), change data movement and any data transfer during data compare all benefit from the optimized communication.
- Standardized communication between remote environments enables consistent and secure connections using SSL and key-based authentication as needed.
- HVR’s architecture is flexible and modular. Communication is always initiated from the central installation that controls the data movement (the so-called hub) to the agents. A firewall only has to be opened in the direction from the hub to the agent, and use of proxies is supported out of the box. As a result, again the setup can be hardened by limiting the need to expose data stores directly by opening firewalls.
Note these motivations to recommend the use of agents are particularly relevant when Wide Area Network (WAN) connections are involved. (e.g. in the common use case of data integration between on-premises and cloud, or cloud-to-cloud.)
It should also be noted that database connectivity varies from one type of database to another. As a result, remote connections may work better or worse across a remote network connection for different scenarios.
Going with agents, agent-less, or… hybrid data integration architecture
How to decide between agents, agent-less when determining a data integration architecture that is right for your business. Following are considerations.
HVR lowers the complexity to manage agents by:
- Distributing only one download per OS. Users download and install one product for a platform irrespective of whether the software is used as an agent, or as the hub, or both.
- Ensuring different versions are compatible, both upwards and downwards. As a result, software installations don’t all have to be upgraded at the same time. HVR’s release notes document whether a fix or feature requires an update on the capture, the integrate, or the hub machine.
- Abstracting the connection information in the Location Configuration. i.e. the connection to the system is defined once, and its use is independent of how it was defined.
Impact on the database server is related to the solution’s performance and scalability. HVR’s agents obviously introduce additional system resource utilization but with compressed data transfer additional CPU utilization is traded off with lower network utilization.
Also when HVR executables can access logs directly there is a lot less overhead involved than when log fragments are served through a database serving many concurrent requests, managing security, etc. As a result, HVR can achieve higher throughputs using direct compared to remote capture. Finally, to scale a setup with only agent-less capture the only real option is to introduce more resources i.e. use a larger machine where the software runs.
Scalability in a distributed environment is generally achieved by using more resources across the distributed systems.
3. Ability to install third-party software
There are scenarios when the software cannot be installed on the database server(s) e.g. because database processing runs on a database service like Amazon Relational Database Service (RDS). To support cases like this HVR supports SQL-based access to perform log-based CDC, and remote access to deliver changes. Depending on the physical location of the systems we still recommend the use of agents. e.g. for an on-premises source and an AWS RDS-based target the communication across the WAN (into the right availability zone) is preferably performed using the HVR protocol as opposed to a database connectivity protocol.
Note also that to mitigate the requirement to install HVR on the production database server(s) HVR also supports direct capture from a standby database (Oracle and SQL Server) or a system that only has access to the transaction log backups for Oracle, SQL Server or SAP HANA.
Common HVR data integration architectures
With cloud-based data stores included in almost every discussion, there are two common architectures.
- Using an HVR hub on-premises with one or more agents in the cloud, in the availability zone of the target, and encrypted communication between on-premises and cloud. The benefit of this architecture is only the firewall into the cloud has to be opened. On-premises agents may or may not be used to scale the environment or to maximize throughput, either on primary database servers or standby systems.
- Using the HVR hub in the cloud with a proxy in a DMZ (De-Militarized Zone), and agents close to if not on on-premises systems. The hub in the cloud requires the on-premises firewall to be opened for communication but only into the DMZ, with the proxy controlling connectivity to the actual data endpoint(s). In the cloud, agents may still be used for scalability, and to ensure optimum communication into multiple availability zones. (e.g. for high availability)