Mark Van de Wiel

Data Validation – The Value of Live Compare

For years organizations have built data warehouses, analytical systems, and, more recently, data lakes. Due to the nature of the query workload – “can we get fast query response times please?” – and the fact that more often than not data from multiple systems is consolidated in a single environment, it is safe to say that in all these efforts data integration plays an important role. Data is copied, often between heterogeneous environments, and sometimes transformed. A key question business users will want to know the answer to is: am I looking at the right data?

And if your organization performs external (financial) reporting out of your analytical systems and data lakes then, more and more, regulators will have the same question.

* DILBERT © Scott Adams. Used By permission of ANDREWS MCMEEL SYNDICATION. All rights reserved.

The challenge

There are multiple generic terms – highlighted below – that relate to the challenge at hand:

  • The business user has to trust the data. Without trust, systems will not be used, or users use them extremely inefficiently. As recently as 2017, per Gartner, 85% of the data lake initiatives failed, many due to lack of trust. Of course, as KPMG points out, there are many aspects to data trust, but surely one way to breach the trust is with incorrect data. As data is moved out of the confines of the packaged application, and in some (data lake) cases, even a relational database, will the business user still trust the data?
  • Data quality, or as a subset thereof, data integrity. Is the data complete? Does data arrive timely? Is it valid (e.g. is the data in a date field really a valid date value?) Is the data accurate? Is the data in line with integrity constraints with regards to uniqueness and cross references? Muhammad Raza’s blog post is a good read about these topics.

Bit-wise correct

Different people will have different expectations when they use terms like data quality, data integrity, and even accuracy. For example Jim Harris on his Obsessive-Compulsive Data Quality blog talks about two characteristics of data accuracy:

  1. form: e.g. agreements on how dates are presented
  2. content: e.gJim Harris and James Harris may be the same person.

At HVR, we consider data to be accurate when it is bit-wise correct. That is, when data values are bit-for-bit identical between source and target, taking into consideration any transformations that may have taken place on the data during the transfer.

Achieving bit-wise correctness is particularly challenging in a heterogeneous environment with different data types on the source than on the target. Adding to the challenge, as is becoming more and more common, is a relational database on one side (e.g. an ERP database), and a data lake on a file system like S3, Azure Data Lake Store, Azure Blob Storage or HDFS, on the other side.

Data compare

For years the HVR data replication solution has included a compare capability to enable customers to validate the bit-wise correctness of their data. Traditionally there are two modes to run compare:

Bulk mode computes a checksum for every row based on all data values, with a final checksum computed across all the checksums for every row. The compare result is determined by whether or not the checksum from one side of the comparison matches the checksum from the other side. Rest assured that a single different bit will cause this result to come out as different.

  • The bulk mode is very efficient in HVR’s recommended, distributed environment with the agents close to the source and target computing the checksum and hardly any data passing across the network. However the bulk mode provides limited detail on the difference. Only the row count is reported, and whether or not the checksum values came out identical.

Row-wise mode compares every column value between source and target. This mode requires data to be sorted – introducing its own set of challenges if character sets between source and target don’t match – and passed across the network in order to compute the differences.

  • Of course, in HVR’s recommended architecture with compressed (and encrypted) communication across the wire, the data transfer is still efficient. A major benefit of the row-wise mode is the ability to report detailed differences, including insight in the exact data values that are different.

HVR has always provided the ability to filter the data retrieval for the comparison in order to get to a manageable data set that can be compared but now we have taken it to another level.

Introducing: live compare

HVR 5.6 introduced live compare with the ability to indicate whether systems are in sync, taking into account in-flight changes. Live compare is only available in row-wise mode because it requires a second phase of the comparison, which can be:

Merge in-flight changes with the observed differences in the first pass, and indicate exactly whether systems are in sync or not. Depending on the volume of data in the system, and the transaction volume passing between source and target, this approach is generally quick and efficient.

  • A two-pass compare run may indicate that for certain rows the compare run is unable to tell whether the rows are in sync because twice differences were found but the observed differences were not identical between the two runs. The two-pass compare capability is likely slower than the approach to merge in-flight changes, but of course, two-pass compare can be performed on data sets not synchronized by HVR.
  • Perform a second pass of the data to identify whether identical differences are observed. If HVR is used to replicate the changes then data replication can be “flushed” in between the two passes, or with another technology in charge of synchronization a duration can be set for a second pass of the data.

At HVR we believe that bit-wise correctness is important for all data replication scenarios, whether your use case is high availability, migration (cloud or non-cloud), geographical distribution, cloud integration, real-time analytics, or data lake. Peace of mind is important, and that’s why we include the compare capability at no extra cost with our solution.

To learn more about HVR and see Live Compare in action, register for one of our weekly demos.

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Test drive
Contact us