Data Replication Performance: Refresh
The first blog in this series discussed a general approach to achieving maximum performance. In the second blog, I highlighted HVR’s distributed architecture and its performance and scalability benefits.
This post focuses on the initial—or one time—load: Refresh. Subsequent posts in this series detail data replication performance and data validation strategies.
Initial Load Strategy
HVR’s feature set is flexible and supports different approaches to achieve the initial data load.
You may use HVR for the initial load and benefit from the initial load’s automatic alignment and ongoing data replication. You may also use non-HVR technology to populate the target. Or you can choose a hybrid approach. For example, you provide the DDL for the target tables, and HVR moves the data. Or you move data from a system that is not the actual source but a point-in-time backup.
Consider your data load strategy when choosing what option to use:
- Do you want to minimize the duration of the initial load?
- How much load on the data source can you afford to support your initial load?
- Can you extend your replication setup over time? I.e., can you start small and grow from there?
You may not want to achieve maximum data load performance—the highest number of rows/s —due to the load this would cause on the source.
HVR’s Refresh capability is used to perform the initial load. Refresh can optionally create or alter table definitions to match the table definition in HVR’s repository and refresh the data.
Refresh can be used for the initial load, subsequent data refreshes, or to propagate table definition changes. If you configure action AdaptDDL, then HVR will automatically use its Refresh capability to propagate table definition changes and refresh data as needed.
Refresh supports all technologies supported by HVR, including:
- Data loads into relational databases, clustered or not.
- Delivery of data into Kafka.
- Writing data onto a file system in a configurable file format.
In all cases, data is streamed into the target. There are no queues in HVR’s infrastructure to land the data during the refresh.
HVR refresh supports two modes:
- Bulk: perform a bulk load into the target. For a database target, by default, the target table will be truncated first if it already exists.
- Row-wise: compare data between source and target, and resolve the difference using conventional DML. Row-wise refresh is only available on database targets.
In its metadata repository, HVR maintains minimal information about the table definition, including:
- Column names, data types, relevant character encoding, and optionality.
- Row identifier – primary key – for replication to function correctly.
- Distribution key information, relevant for databases where data distribution matters, for example, Teradata, Greenplum, and Redshift.
Recording the distribution key is important if HVR creates the tables and for ongoing replication to perform well. Review the distribution key(s). Redshift only supports a single column per table for the distribution key and uses actions to adjust the key as needed.
Avoid using a (combination of) column(s) with few distinct values or significant skew as the distribution key. Without a good distribution of the data, the initial load and ongoing replication will likely be limited by a lack of parallelism.
Action DbObjectGeneration may be used to augment the create table statement.
HVR Refresh creates tables and/or primary key indexes based on the repository table definition. The data types in the create table statement may be adjusted to avoid loss or precision.
For all commonly-used target platforms, HVR supports bulk load following technology-specific best practices. For example, in Snowflake, HVR supports data staging on cloud storage, followed by a copy into Snowflake. On Teradata, HVR leverages TPT bulk load. For Oracle direct, path loads are performed through Oracle’s Call Interface (OCI).
Maximum data load performance i.e., the highest number of row/s is only achieved with a bulk load. Only consider row-wise refresh if data volumes are limited, and the target database can efficiently process conventional DML operations. Row-wise refreshes can work well on traditional OLTP databases like Oracle, SQL Server, PostgreSQL, and others.
HVR supports different kinds of transformations:
- Mapping source schema to target schema.
- Mapping table and/or column names on the source to table/column names on the target.
- Deriving column values through formulas.
Bulk load performance is not affected by mapping of schema, table, or column names. However, bulk load performance may severely degrade to populate derived column names, unless:
- The value is static, or
- The derived value is presented as part of the data set that is bulk loaded. I.e., the value is computed as part of the data extract rather than as part of the data load.
Use the option /Context on the action to perform the formula as part of the select. Then explicitly use the context during the refresh. Create an equivalent action to perform the formula on the target explicitly when the context is not in place. During data replication, derived columns are more efficiently computed on the target.
Filters – Restrictions
Common uses of filters include:
- Limiting the data set from the source. For example, you only want data for certain categories or a single company code (in SAP). Or maybe you have historical data in the source that you do not need in the target.
- Splitting a large table into multiple parts. Depending on your environment and configuration, refreshing the table at once may take too long, or result in errors like ORA-01555 “Snapshot too old” in Oracle. Individual parts of the table become a unit of recovery should a load fail for whatever reason.
- Enabling parallelism.
HVR uses the term slicing to split a large table load into multiple parts.
HVR provides the action Restrict with the argument /RefreshCondition to define a data refresh filter. When applied against the source, the SQL fragment becomes part of the where clause in the select statement.
When applied against the target, only rows that match the condition are affected by the refresh. I.e., on the target HVR will delete rows that match the refresh condition before pulling data from the source (with or without a filter).
Note: removing data from the target is not available through HVR for Kafka or file systems. You must take care of purging any existing data outside of HVR on these platforms.
A common approach to prevent truncating the data in the target database during the refresh is to define action Restrict /RefreshCondition=”0=1″ on the target. This condition results in HVR issuing “delete from <table> where 0=1” on the target, which results in no rows getting deleted, which most databases immediately recognize and avoid a full table scan.
When slicing a large table, optionally to enable parallelism, make sure you use a filter condition against a (combination of) column(s) that does not change. Unless you perform the load simultaneously—e.g., against a static backup—rows could move from one slice to another. This can cause inconsistencies. You may load some of the data more than once in different slices, or rows may not get loaded at all.
HVR refresh parallelism can be achieved in three different ways:
- Combine multiple tables in a single job, and process multiple tables in the job in parallel. Parallelism is controlled by setting the number of tables you want to load in parallel when you create or submit the job.
- Create multiple jobs, each with one or more tables, and run multiple jobs in parallel. Parallelism within the job can be controlled upon job creation. Parallelism across jobs can be controlled in the context of the scheduler using the quota_run attribute. This attribute can be set as part of the creation of jobs. The setting defines how many jobs can run concurrently within its tree of child jobs.
- Slice a large table into multiple parts as separate jobs. Parallelism is controlled using the quota_run attribute within the context of the scheduler.
Note: in case of a failure, the job, by default, will restart and redo all of its operations.
Whether or not you can slice a large table and load in parallel depends on the target. For example, onto file systems like S3 or ADLS, you can run in parallel as long as you ensure as part of the file name you include the slice number (in Integrate /RenameExpression). This will ensure unique file names across slices.
Note: slicing a table may decrease the source table select’s efficiency because, in many cases, data from a full table scan is discarded.
A slice condition matching an existing partitioning scheme on the source can avoid this efficiency loss.
If data retrieval from the source database is limiting refresh performance, you may be able to introduce parallelism on the select from the database.
Most databases do not allow more than one concurrent direct path load into a single table due to an exclusive DML lock. Of course, despite this limitation, you may still slice the table to split the load into multiple separately recoverable parts.
Depending on your needs to load a table quickly, and the bottleneck preventing faster loads, you may choose to use multiple staging tables to load data in parallel. Once loaded, combine the data in the target using insert as select.
We continue to enhance HVR’s refresh capabilities to allow for faster loads over time. Refer to the documentation, or contact us for up to date information and strategies on achieving maximum refresh performance.
In the first blog post in this performance series, we discussed strategies to achieve maximum performance. You may be running HVR in a distributed setup, with multiple aspects of the architecture potentially limiting performance, including (but not limited to):
- Source database select
- Any network connection between the database and where HVR – possibly an agent – runs the select
- HVR source agent compressing and optionally encrypting the data
- The network between source and hub, or between the hub and target agent
- Processing the data by the target agent (e.g. formatting and encrypting data for staging, or to write to a file system)
- Any network connection between the target agent and the target may be a staging area
- Actual data load performance into the target, for example, a busy database session
Maximum refresh performance is achieved when your load is most efficient, and you reach a performance-limiting ceiling in the infrastructure.
From working with customers, we have seen net data load rates at an aggregate of more than 1 GB/s onto S3 using multiple sliced refreshes in parallel (balancing the number of slices to achieve optimum select efficiency). In this case, the ultimate limiting factor was CPU utilization on a powerful server with Intel Xeon E7-8894 v4 processors, formatting data, and performing client-side KMS encryption. If we used more than five slices per table the bottleneck shifted to Oracle Database disk IOs that maxed out at 10 GB/s on this configuration.
Initial Load Performance
It is impossible to know upfront what data load performance you will achieve. Multiple factors may limit performance. Very high load speeds can be achieved through HVR on a well-balanced configuration.
The next blog post will discuss data replication performance.