Helping customers architect their data management infrastructure is one of the most exciting aspects of my job. When organizations first engage with HVR, they are often in the early stages of their digital transformation journey: the realization that data assets can be leveraged to improve customer service, generate more revenue, lower risks, safe costs, etc.
HVR’s real-time data replication enables organizations to consolidate data from multiple transaction processing systems into a consolidated operational data store, data warehouse, or data lake, with very little to no performance impact on the data sources. But the digital transformation journey has other aspects, including cloud adoption, a shift toward leveraging cloud services, and with that the evolving data infrastructure.
In this blog post, I’ll take you through a customer’s journey that started with an exploration of HVR’s Change Data Capture (CDC) capabilities in the fall of 2011, with an initial production deployment in 2012. The use case is for a near real-time data warehouse with data from SAP ECC, running on Oracle RAC, as the most important data source. Additional data sources include a couple of other relational databases and a file store.
Phase 1 – On-Premises Analytical Data Processing
In 2012, the organization deployed an analytical database Actian Vectorwise (now Actian Vector) to replace its then dated data warehouse implementation that was running on old hardware using legacy relational database technology that was arguably better suited to run OLTP than a data warehouse workload. The old data warehouse was populated through daily ETL routines, using a “last updated” timestamp on the database tables wherever possible, in an effort to avoid full reloads of the data.
The organization chose HVR to populate the Vectorwise analytical database in near real-time. As the data engineers started querying the high-performance analytical database they started to realize one of the immediate benefits of HVR: log-based CDC. With HVR’s unique log-based CDC technology the data engineers were able to see all of the database changes, and previous assumptions around the backdating of records in the database related to the batch ETL jobs were flawed. Data in the old data warehouse had not been accurate all along!
Phase 2 – From On-Premises Analytical Processing to a Cloud-based Data Warehouse
The on-premises Vectorwise database was extremely powerful but limited to a single server, and more data or more processing capacity required a larger server. Also, ensuring system-availability turned out to be more labor-intensive than originally expected.
As a result, in 2015, the organization started exploring the cloud. Pay for what you use. Scale-up and down based on what you need. No database management required. Who doesn’t want that? Amazon Redshift seemed to be a natural choice. During the evaluation period, the on-premises analytical environment continued to run on the Vectorwise database. Since the customer had a positive experience feeding changes in near real-time into Vectorwise, HVR was also used to feed data into Redshift.
With the shift to the cloud, and given the ease of allocating compute resources there, the HVR hub (the environment controlling the replication) ended up on an EC2 environment. To mitigate the network administrators’ concern of opening up a firewall into the on-premises data center, an HVR proxy was deployed in a DMZ (De-Militarized Zone). The proxy provided both the gateway to the HVR agent running on the source database server, as well as the gatekeeper for any traffic not allowed to go through. Data transfer between on-premises and AWS was also encrypted.
In 2016, the cloud-based data warehouse on Redshift went live. As part of this move, ELT routines were implemented on Redshift, taking advantage of massively parallel processing in the database to process transformations. To facilitate ELT, HVR was configured to populate extra information in the target Redshift database: commit timestamp on the source, integration timestamp into the target, and an extra column to indicate when a row was deleted in the source instead of physically deleting the row. In HVR, this function is called soft delete.
Soon after the go-live on Redshift, an internal group needed access to some of the same data, but with a specific dependency on an Oracle Database. With CDC from SAP ECC already in place, another endpoint was added to the existing data flow to replicate a subset of the tables into an Oracle RDS (Relational Database Service) Database in AWS (capture once, deliver twice). Since the Oracle Database is better than Redshift at processing small row-level transactions, the latency between the SAP ECC source and the Oracle RDS Database was down to single-digit seconds, with the data getting delivered into Redshift hourly using scheduled incremental micro-batches.
As the organization became more and more data-driven, there was an increasing focus on data quality. HVR’s compare function was leveraged to validate the data accuracy on all tables on a weekly basis, with compare results made available in a database table for the end-users to view. Using months of HVR data replication statistics, it was determined that out of about five hundred tables that are replicated, over three hundred had not had any changes during the weekend. For these tables, a bulk (checksum-based) compare was scheduled to run during the weekend. Other tables would be compared in a more detailed, row-wise manner, with the biggest tables filtered based on a create or last modified date to limit the amount of data to be compared.
Phase 3 – Shifting Data Sources to the Cloud
In 2018, as part of an effort to move out of an on-premises data center, the source SAP ECC, running on Oracle RAC, was also moved to the AWS cloud. This phase was relatively uneventful from a data infrastructure perspective: the connection to the data source was changed, and with the source now living in the cloud there was no longer a need for a proxy. The migration took place over a long weekend with scheduled downtime for the SAP ECC application, and data replication simply resumed where it left off.
Phase 4 – Optimizing the Investment
Recently, due to COVID-19, the organization has accelerated its efforts to look for cost optimization opportunities. The setup with the data replicating in near real-time into Redshift works well, ELT in Redshift runs efficiently and fast, and access to any data is really easy through SQL. However, some of the convenience in managing the environment comes at a relatively high cost, and by shifting, data flows into S3, with ETL running through Glue, there is an opportunity for not-insignificant cost savings in the long run.
Where to from Here?
Cloud services open up many data-driven opportunities. No longer does an organization have to decide whether to invest in a technology with a high up-front cost, significant hardware requirements, and potentially high implementation costs. Cloud services enable access to powerful technology in a scalable environment, using a pay-as-you-go model, with security integrated across multiple technologies and services. As long as the data is in the cloud, organizations can leverage technologies such as text search, streaming data, or machine learning, available with a few clicks.
Similar to this use case, we often see customers evolve their data management architecture over the course or several years. In 2015, only a few of our customers were using cloud services. Nowadays, almost every organization leverages the cloud for at least part of its data management infrastructure, with more and more data and applications moving there. I am looking forward to working with more customers on a similar, exciting digital transformation journey.
To learn more about how organizations are accelerating their digital transformation strategies by leveraging real-time data and cloud technologies, register for a Digital Transformation Storytelling session.