Making Sure Your Data is In Sync
Comparing your Data Across Multiple Source and Targets
When you have data stored in multiple applications and want to consolidate it into a central data warehouse or data lake, you need to make sure the data in the source and the target remain consistent. But while consolidating data seems like it should be easy, the process is deceptively difficult. Successfully keeping data synced requires you to consider a number of factors, including data types, character set types, and protection from human error.
Syncing up data types
Data type conversions are difficult to manage. For example, converting dates from one format to another can get complicated because some technologies include a time component in a date data type, and others don’t. In addition, binary data cannot always be represented using identical data types. In order to ensure that you don’t lose any data and/or data value precision, make sure data types in your source and target are compatible.
Character set conversions
Character set conversions can be very complex. A dirty little secret is that some applications allow you to store incongruent character data. For example, if your database is in a Western European language using single-byte characters, it may allow you to store Chinese or Japanese double-byte characters. Although the application can convert between these character types, when you move the data to a data lake, the new tools processing the data may be unaware that the data is stored in a different format. Standardizing on an all-encompassing character set like Unicode for all data sets is a way to ensure that you can store all data irrespective of origin and character set.
Accounting for human error
Another issue to consider is the human factor. While the press likes to talk about the risks of hackers getting into data, the challenge is actually more mundane. Humans make mistakes. And while production databases have numerous safeguards to prevent end users from deleting data or making mistakes that can corrupt data, rarely do these safeguards entirely eliminate issues. Users can delete data without realizing it. You need to put in place protections to avoid loss of data for these accidental cases.
You have two ways to safeguard your data from human error. You can maintain copies of your data through traditional replication and through Big Data replication.
With traditional replication, you replicate all the data on the source (Point A) to the target (Point B). Whenever you do an insert or update, add a column, drop a column, or use data definition language (DDL) commands (e.g. create, modify, and remove database objects), the data is replicated over to the target. When you use HVR tools for data replication, you can take advantage of HVR’s Compare capability to validate the data, fix it if the data is out of sync and account for accidental data loss on the destination system.
In the traditional data replication use case, what happens if we perform a delete? We don’t necessarily want to replicate the deletes, but we do want to mark the row in the table in the data warehouse that was deleted. This notification informs the downstream consumers that data was deleted in the source as well as what data was deleted should they need it back. HVR offers a “soft delete” that flags a column on the target and indicates if it was deleted on the source. HVR’s Compare feature is smart enough not to compare rows that the soft delete function marked as deleted in the source system.
In some data warehouses and even more data lakes in file systems like S3 or HDFS where there are no updates or deletes into the files, we turn data into a time series. With a time series, inserts, updates and deletes on the source become the equivalent of inserts (e.g. new rows) on the target. To perform a Compare in this case, HVR enriches the data by adding metadata to the target system. For example, we can add an operational type (such as insert, update, delete on the source), note when the operation occurred (by adding a unique timestamp and maintaining in a table) and specify the source user and source system.
How do you compare the time series data on the target with an ever-changing window of data on the source? HVR’s Compare feature can take the time stamp rows, coalesce them down to a single row on the target and compare that to the source system. Deletes aren’t included in the compare result because there won’t be anything on the source. We take the updates and integrate them with all the inserts we did before so we have a nice clean row where we can compare the source and the target.
If you have any questions on how you can more effectively synchronize your data across sources and data warehouse/data lake targets, we’re happy to answer them. For more information, please contact us.