What to Look for in a Data Validation Solution for Replication
Many of us have experience moving to a new home or city. Personally, I have moved across continents and within the contiguous United States. In any moving process, it’s common to end up with missing valuables or broken items that leave you wondering if you should have packed better, picked a different moving company, or just thought through the integrity of your valuables before, during and after the move. In my experience working as a product manager for a data replication solution, many customers shared similar concerns when moving their data. How can you trust the integrity of the data being moved and delivered from it’s “home” to your business users?
When your cloud data movement projects involve hundreds of gigabytes of data per day, latency and data integrity can appear to be at odds. However, most enterprises fueled by data-driven decisions cannot accept this perceived trade-off. For example, a use case of moving financial data into a cloud-based data lake like Amazon S3, low latency, and high fidelity are equally critical and cannot compete for priority.
Considering that the success of data movement projects depends on the integrity of the data being delivered, let’s examine the reasons data quality could be compromised within a data pipeline, data validation solutions, and things to look out for when evaluating data validation options.
The most common root cause of databases—on either end of data movement—to be out of sync (let’s call them source and target from here on) is how data is handled by the endpoint databases. Take trailing white spaces for example and how each database type handles it.
When the source is an Oracle database with column type as VARCHAR2 and the target is a SQL Server with column datatype as a TEXT, trailing white spaces on the source are trimmed on target by default. If your downstream applications rely on a number of characters or bytes of storage of these values, this trailing white space or lack thereof will show a differential output when pointing to different databases.
Another example is the scale and precision differences supported by database drivers that can result in different extents of rounding between source and target column values. Customers/applications sensitive to rounding differences will find this an important requirement for accuracy in reporting.
When replicating changes from source to target databases, uniquely identifying the row(s) that change with each operation or a set of operations within a transaction requires a primary key (PK) or unique index (UI) in order to consistently apply changes to the destination database.
While some replication solutions can handle replication of tables without PK/UI, they clearly state the possibility of inconsistent updates due to the lack of a unique identifier to the row being transacted on. That means there is less certainty on the row changes accurately propagated to the destination database.
Table structure differences:
Table structure differences arise when schema changes are not automatically applied to the destination databases. For example, when a column is added by an application user but the new column isn’t automatically propagated to the destination database, subsequent changes involving the newly added column are not captured as part of ongoing replication. Thereby leaving your source and target databases out of sync. Same with a dropped column, if your data replication solution cannot automatically adapt to table structure changes, your source and target will almost always be out of sync from the moment these schema changes occur.
Mature data replication solutions have guardrails in place to prevent missing transactions and missing data. Once in awhile user error causes incorrect points of recovery after a failure, incorrect resetting of the checkpoints from which changes are expected to be replicated, or an inconsistent reconciliation point between initial data loads and replication. Also, certain databases, specifically the SQL Server database, truncate database logs periodically and if your replication solution isn’t current at the point of truncation, there is a high likelihood of missed transactions for log-based CDC solutions.
As you dig through various data movement use cases, there are many reasons why source and target databases could fall out of sync. Regardless of the cause of the sync issue, your replication solution needs to be able to give you confidence in the quality of your data.
In the next section, I will explore data validation options available in popular replication solutions. Keep in mind, as a customer, your unique needs require unique solutions. I am simply giving you food for thought when interrogating your replication vendors for their capabilities around data validation.
It’s imperative that your replication solution provider offers a data validation solution that’s built into core replication. Replication solutions that require you to make a separate purchase simply to validate data—that their solution should have moved with zero issues in the first place—defies realistic expectations.
Some popular data replication solutions require you to purchase a separate data validation solution that is difficult to set up, more expensive than replication itself, and most importantly, requires separate administration and maintenance. Such solutions incur a high total cost of ownership, a long time to market, and friction at the time of adoption.
Application databases transacting with end-users directly impact your bottom line. When a database under critical transaction loads is being replicated into your enterprise data lakes for real-time data delivery, your replication solution should be able to perform data validation without taking the database/replication offline.
This capability requires innovative features that can account for changing data that has not yet been applied at the target. Also, required are features that can clearly delineate data that is out of sync versus what is being transacted upon.
Your data validation solution should be able to support ‘online’ data validation.
At the heart of any data validation query is a data retrieval executed on the source and target databases, with some level of filtering, stitching, and sorting to be able to check for any data differences across endpoints.
The inherent nature of such activities is to consume resources on the source and target databases. As the user, you have to find an ample opportunity to schedule such activities in order to minimize the impact on the database resource consumption.
Recovery and Restartability:
When replicating tables in the order of terabytes and verifying data validity across wide area networks, it’s imperative that your data validation solution is able to recover and restart with the least amount of overhead.
For example, when large tables are being compared across WAN for hours at a time and an unexpected network outage occurs, your replication vendor should be able to support stateful data validation jobs so that recovery and restartability are much more efficient when compared to stateless data validation routines.
Such a capability would be quite efficient when network bandwidths are premium and somewhat intermittent.
Finally, options and details matter. While data replication solutions sometimes offer data validation capabilities, they typically have just a bulk option. In such cases, all you get is a checksum on both source and targets and a binary output of ‘yes in sync’ or ‘no not in sync’. When databases are in sync this bulk option is quite handy for a quick check but when your source and target databases are out of sync, how do you determine which rows are out of sync, why they are out of sync, and how to fix them?
In your data validation solution, look for both bulk/brute force quick options along with row by row/surgical slow options. When you are able to pipe one row at a time and clearly identify out of sync rows and mismatched columns, your solution provider should have the capability to surgically repair the rows across heterogeneous databases.
There are also some solutions that approach data validation as more of a metadata comparison – for example, a summary report on row counts, last transaction timestamps, and the last user accessed for source and targets. While such reports are interesting and informative they don’t save you time. When many rows are out of sync, you still have to find the proverbial needle in many haystacks.
In conclusion, no matter how robust your data replication solution is, there will be cases where data integrity during data movement can be compromised. If data fidelity is important, evaluate replication solutions that offer a built-in, robust, comprehensive, and in-depth feature set that can help your business trust the data you provision for them.