Mark Van de Wiel

Data Warehouse Replication

Most readers will be familiar with the proverb “If you have a hammer, then everything looks like a nail” to indicate that one may be inclined to use the wrong tool to accomplish a task simply due to familiarity with the tool. Even though there may be a better way to achieve the same result, in the world of real-time data replication, one may be inclined to rely on log-based change data capture and replication to implement a solution. As an example let’s look at multiple identical copies of a data warehouse environment, continuously accessible to end-users based on their geographical location. The data warehouse is populated in real-time with information out of the CRM application as well as frequent bulk loads of pre-processed weblogs.

The traditional thought may be to set up replication from the CRM system into one of the data warehouses and then cascade the replication to the other sites. Any weblogs loaded into the primary site will automatically be replicated the other sites, and everything works just fine. But what if there was an alternative, arguably better way?

Before talking about the alternative approach, let me make some assumptions about this environment:

  • Target database is Oracle. Of course, the target database could be another platform but I am simply more familiar with Oracle and its options than I am with some of the other databases.
  • Data volumes are very high, so load performance is critical.
  • High Availability for the Data Warehouse is based on (1) the geographical distribution of the system (users can connect to another system if one is down) and (2) backup/restore of a surviving system rather than the traditional OLTP approach to restore the most recent full backup and replay transaction logs until current.

With these assumptions in mind, the original suggestion to use one date warehouse as the master looks a little less attractive, particularly because of the load performance requirement. Logging simply trades availability for performance, and real-time data replication tools require all loads to be fully logged in the transaction log.

Oracle Database provides NOLOGGING operations (note: performing NOLOGGING operations impacts the recoverability of a database). Oracle Database also provides table partitioning that is commonly licensed for a data warehouse environment to benefit query performance. Table partitions can also be used effectively in conjunction with (limited) bulk loading strategies by performing a so-called Partition Exchange Load (PEL). What if instead of loading the data with full logging into a heavily indexed fact table we load the data with the NOLOGGING option into an empty table, build the indexes, and perform a Partition Exchange Load operation? Chances are the second option is a lot faster to perform the load, and chances are there is a lot less impact on the query workload on the system when performing these operations.

However, the challenge then is: how to keep the multiple copies of the data warehouse in sync?

With HVR you could keep the multiple data warehouses in sync by using extensions the product provides out-of-the-box, namely the Agent action. The Agent action can do anything you program it to do in pretty much any programming language that runs on your system (e.g. shell scripting, perl, etc.). So the Agent action can perform a bulk load, build indexes, and perform a Partition Exchange Load operation, provided the data file is available on every server (and HVR could actually help with that too, if that were a challenge). When does the Agent action perform the load? The Agent action is triggered as part of the data flow so a row in a table may be the trigger to kick off the bulk load at a particular point in the data flow to ensure any data-related pre-requisites such as referential integrity references are taken care of.

And if this Agent triggering table was (in the example) part of the CRM system, then a different architecture can make the implementation not only better performing but arguably also more resilient to outages, given that the original setup relied on one of the data warehouse systems to be the master that is replicated to the other sites. A revised architecture could send data directly from the CRM system to all of the data warehouse systems, making the bulk load an integral part of the processing on the individual data warehouse server.

Think about how the ability to trigger functionality as part of a data flow could make your application perform better or be more available in real-time.

And as always, if you’d like to try your ideas using our software then please contact us for a free trial!

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Test drive
Contact us