Business users access replicated data to offload complex queries from a transactional system or to consolidate data from multiple systems. Data warehouses and data lakes succeed or fail based on the ability to provide data that the business can trust.
In a high-volume data replication environment, you must be able to trust the replicated data. All changes processed on the source must correctly be processed on the target, applying any defined transformations. Building trust takes time.
HVR’s Compare function is a data validation function that can validate whether data on the target matches data on the source in a heterogeneous environment. This blog post covers best practices around data validation performance.
HVR validates data by selecting data from the source and target, applying any transformations to map source to target data. Supported endpoints include natively supported databases and file systems if the files can be accessed through Hive or with data stored in a CSV format. Data can be validated for different kinds of integration, including soft delete and so-called TimeKey replication representing every change as a new row.
Following a minutious comparison, HVR returns the result of whether or not a table is identical. HVR Compare can run in two modes providing a different level of detail when data is out of sync.
- In bulk mode, in which a checksum is computed on all data returned by the query. HVR compares the checksum values on either side of the comparison to report whether data is in sync. Checksum computation is performed in a distributed setup.
- Row-wise mode, bringing all data, sorted, together on the right side of the comparison. HVR then will validate row by row, column by column if data is identical. For any out-of-sync condition, HVR reports whether it requires an insert, update, or delete to resolve the deltas. When run in a verbose mode, actual differences are written to a file.
HVR can consider in-flight changes when operating in row-wise mode, using the option “live compare.” Bulk compare reports in-flight changes as differences.
(For SAP cluster and pool tables, HVR only supports row-wise comparison. Live compare is not currently available for such tables.)
Classic and Event-Based Compare
The Compare dialog in HVR provides two ways to schedule the compare:
Future versions of HVR will standardize event-based comparison. For now, you must choose between an event-based compare to run live compare or compare a database with CSV-based files in a data lake. Classic compare is still available to create a recurring schedule of compare jobs.
In case of an error, the classic compare job will automatically restart and rerun in its entirety. An event-based job will resume where it left off, even in the middle of a sliced compare of a large table.
Data validation is not easy between the same type of systems. In a heterogeneous environment, it is even harder. Some of the challenges include differences in data types and character encoding. Also, the data type float is considered lossy, and sometimes data is corrupted. Some of these factors also influence sort order in a heterogeneous setup.
HVR’s Compare function standardizes data types internally, coercing data to fit the internal representation. In the channel definition, you can control how HVR deals with coercion errors. By default, a coercion error results in a fatal error.
Data in transaction processing systems often follows a pattern. Some data is always subject to change. After its first insertion, other data may continue to see changes then become dormant over time. Some data is inserted and never changes, like ledger data or machine-generated data points. Use your knowledge about the change patterns of data to be smart about data comparisons. If data does not change, then consider a low comparison frequency.
Data Validation Performance Recommendations
Validating data can be very resource-intensive. Focus on the business requirement as you implement a data compare strategy. For example, you may perform longer, more resource-intensive comparisons over the weekend and limit runs during the week to the most critical tables.
Achieving Maximum Performance
HVR runs at maximum compare performance if CPU cores for the compare processes are 100% busy during the run. Depending on your architecture and the compare mode, the processes doing the work are on the agent server rather than the hub. Not being able to consume 100% CPU core utilization may indicate an IO bottleneck.
A row-wise compare operation performs a memory-intensive sort. HVR will push the sort operation to the database, if possible, depending on the data types in the table’s key and the database’s code page. For HVR to sort, a large data volume may spill to disk, limiting performance. In this case, HVR will provide a recommendation to increase memory allocation to avoid spilling to disk. Only allow incremental memory allocation if such memory is available. Avoid memory swapping to disk, also if this results in HVR spilling to disk.
To limit network utilization, the row-wise comparison takes advantage of HVR’s compression as data is sent to the right side of the comparison.
Use Bulk Compare
HVR’s bulk compare computes the checksum for the comparison as close to the replication endpoint as possible. A bulk comparison does not require a sorted data set. Only the checksum value and minimal logging get sent to the hub to determine the data validation result.
To leverage the bulk compare mode, you must get to a static data set.
- Review data replication statistics to determine whether tables are active during your planned data validation window. For example, you may find that many tables are not changed during the weekend. This may be a good time to run bulk comparisons.
- Filter data down to a static data set. For example, every week, compare data between the day before and the week prior to this day. Some knowledge of the application is required to determine if such a filter may result in a static data set.
Use row-wise comparisons, if:
- Bulk compare finds unexpected differences, and you want to look at the different data.
- It is impossible for you to identify a static data set, so you must use run a live comparison.
Leverage Point-in-Time Comparison – If Possible
Bulk compare will be an option with in-flight changes if a point-in-time comparison is possible.
Some scenarios enable an explicit point-in-time comparison. An Oracle Database allows a flashback query through HVR’s “As of Moment” option. A data lake target, using hvr_integ_seq as the TimeKey includes the SCN on the target. Because the target using a TimeKey always records new values, a point-in-time query can be executed on either side of the comparison, irrespective of in-flight changes.
Another scenario for a point-in-time comparison includes a filter on an application column resulting in a static data set. Consider an insert-only scenario like a ledger. With data replicating every day, you can validate yesterday’s data (e.g., based on a creation date or range of document IDs) that you would expect to be identical.
Be careful when using a “last_modified” column as the basis for a filter. Presumably, the column value changes for every update, so the data set for the comparison may be subject to in-flight changes.
Any data transformations as part of the channel definition should be performed as part of the data select. These would be the same transformations used when getting the data from the source during the refresh operation. For example, to set a field to uppercase, you may have used an SQL function UPPER.
Using the action Restrict, you can introduce a filter on the query that forms the basis for the comparison. You can use a dynamic filter using a lookup from another table or take advantage of runtime variables such as the current time. You can also use user-defined variables that you can provide at runtime.
Slicing a table, i.e., to divide a large table into smaller portions, is a special kind of filter. To define your slicing condition outside of the ones HVR provides for most platforms, use the special attribute in the Restrict action.
Leverage Projection – If Needed
By default, HVR compares all column data in its entirety, including any LOB values. This can result in a lengthy and resource-intensive process. Use action ColumnProperties to eliminate less important columns from the comparison to help speed up the runs.
You may have a surplus capacity to perform data validation. HVR provides the ability to parallelize the compare run. Note HVR’s parallelism runs on top of any filters or projection you may perform to minimize the data volume to be compared.
- Compare multiple tables in parallel.
- Slice a large data set from a single table in multiple smaller parts.
Optimize the Query at the Endpoint
Always review query execution at the comparison endpoint to avoid unnecessary bottlenecks.
For example, the slice condition you use may end up running a full table scan. In this case, significant portions of the query result get discarded by the database. However, all data is retrieved off the disk, possibly introducing a database IO bottleneck.
In some cases, the database optimizer may have chosen an undesirable execution plan. For example, an index may be used, resulting in scattered database IO, when a table scan would be more effective. Tune the database query in such a case to get a more desirable plan. Depending on the database, you may be able to use hints.
Leverage database partition pruning if possible, either during slicing or for filters you created that may use application values for manual restrictions.
Poor data quality in a data warehouse or data lake will likely limit its success. To replicate data accurately from source to target is the start of sound data quality. HVR provides a data validation function to help you keep an eye on accuracy.
Read more from the Data Replication Performance series
Data Replication Performance
Data Replication: Performance Architecture
Data Replication Performance: Refresh
Data Replication Performance: Capture and Integration