How to Perform an Initial Load During Real-Time Data Integration
Imagine your employer decided to gradually migrate away from your current primary database technology to an alternative technology. And you are the one who has to make this happen. Initially the alternative technology will only be used for real-time reporting, but over time it is expected that the alternative technology becomes the primary database, with the option to keep the current technology around for legacy systems, or as a fall-back option. Given the real-time nature of the requirement you came up with a short list of heterogeneous replication technologies that you can use to implement the requirement.
Now what? Where do you start? How will you get the tables created? And what about the initial data load? None of this is easy if there without schema changes, but what if there are some?
Here is a traditional approach:
- Thanks to ANSI SQL the DDL for the new target database is roughly identical to the DDL for the current database, so you use the current database DDL as your starting point. Maybe you go look for another tool to help you with this translation.
- This is a heterogeneous scenario so you cannot rely on database vendor tools for the initial load. You want a consistent snapshot of the data and start Change Data Capture following this… What will you do? Use the database vendor tools anyway to create a duplicate (inactive) database that you can use as the source for the initial load? That takes time, requires an extra server, and still hasn’t moved any data. But it is a starting point. Then? Use a separate ETL tool (maybe an open source tool) to extract the data and load it? Any transformations can certainly be introduced in this approach, and whilst the efficiencies (or lack thereof) of remote database connectivity will determine the throughput you get at least you get going.
- You start a real-time data integration after a (possibly lengthy) initial data load. Of course you implemented any schema changes in the real-time flow as well. Replication technologies generally support a mode in which errors related to the overlap between initial load and real-time CDC are ignored. Use these options until CDC catches up to current.
After all this effort, are you in sync? Are any schema changes you implemented for the initial load identical to the changes you put into the real-time flow? Maybe you can run some row-counts to do a sanity check of the data…
You may think of initial load as a one-time step. In reality it often isn’t. For example, requirements change and now you need an additional transformation, or you must add tables that initially were outside of the scope of the project. Or for whatever reason you uncover that some tables are out of sync. What are you going to do?
I would not write this if there isn’t a better way to deal with initial load, which involves HVR. With HVR you build real-time integration between databases in a heterogeneous environment in a channel. The channel contains any transformations you want to implement. The channel can then be used to:
- Generate DDL on the target database, including primary keys (often enough to get started with real-time reporting).
- Initially load the data, automatically transforming source database data types into destination data types.
- Perform real-time change data capture and data integration.
- Provide the basis for database validation and repair.
I.e. all these steps include any transformations you implemented. Data is automatically compressed (5x – 10x compression is common) and optionally encrypted over the wire.
Two Ways to Perform Initial Load
HVR provides two modes to perform the initial load (“Refresh” in HVR terminology):
- Bulk mode: create, or truncate the target table and load the data using a bulk load method the database provides. Indexes are built at the end of the load. This mode is ideal for the initial load and for a fast synchronization if the target application can deal with the truncate followed by the bulk data load.
- Row-by-row: perform regular DML to apply data changes based on the difference between the current state of the data and the desired state the channel implements. This mode is ideal if there is some but relatively little difference between the source and the destination and rather than truncate followed by insert you want to perform deletes, updates and inserts whilst indexes are updated and data is available.
Initial load is typically thought of as going from source to target. HVR makes it really easy to reverse the roles and go from target back to source for a one-time operation. For example, the DBA thought he/she was connected to the development environment when a table was dropped, but it was the production system… Of course you can restore a backup and perform a point-in-time-recovery, but wouldn’t it be easier if you could quickly take the replicated state of the table and put it back in the production database? With HVR you can.
Can this be easy? Decide for yourself: invest roughly 10 minutes to watch Chris perform a real-time integration scenario from Oracle to SQL Server.
Want to learn more? Contact us. We’re here to help!