1TB of Data Changes Per Day? No problem.
How HVR enables high volume data replication from systems such as SAP into Snowflake
Snowflake is the first data warehouse built for the cloud. With storage separated from compute resources, it provides sheer unlimited scalability, on-demand, with no need to tune the system. Its flexibility to ingest data of virtually any kind easily, makes Snowflake a very popular database for lots of organizations worldwide who choose to adopt the technology.
Snowflake is one of many target technologies that HVR supports, and recently it has been one of the most popular destinations. With the increase in popularity for the Snowflake platform, why do customers use HVR’s heterogeneous data replication to ingest data into it?
In this post, I share some of the common reasons customers such as 1-800 Flowers and Pitney Bowes choose HVR for data replication into Snowflake.
Change Data Capture
Large organizations (still) perform a lot of their primary business processes through traditional transaction processing databases. These databases, critical to efficiently run business operations, have grown large over time, and now process large volumes of changes. Organizations have a need to include data from these systems in their data warehouse or data lake in Snowflake, but cannot afford a significant incremental load on the database. Log-based Change Data Capture (CDC), in which changes are identified by parsing the database’s transaction logs, is generally considered the approach with the lowest overhead to individual transactions, but also to the system overall.
HVR performs log-based CDC on a number of different database platforms, with a recommendation to deploy agents for optimum performance and efficiency, and to distribute load. To address many organizations’ preference to avoid any extra load on the mission-critical transaction processing database systems, depending on the database technology, HVR supports alternative approaches to perform log-based CDC including capture on a standby system, or by just parsing the backups of the transaction log on a separate system altogether.
SAP to Snowflake
Large organizations often use the SAP ERP suite. For years SAP has been running a campaign “The best-run businesses run SAP“, and its blog adopted this phrase as its title. At present, SAP ECC (Enterprise Core Components), released in 2004 but still supported today, is the most commonly deployed SAP version. SAP ECC is supported on a number of different database technologies including Oracle, SQL Server, DB2 and Sybase (now owned by SAP). SAP’s release after ECC is called S4/HANA and is currently only supported on the SAP HANA database. SAP is working with its customers to migrate ECC deployments to S4/HANA by the end of 2025.
At a high level the SAP ECC suite features three different kinds of tables:
1) Transparent tables, for which an application table maps one-to-one to a database table, and a database-level query retrieves the actual data visible to the application.
2) Pool tables, mapping multiple application tables to a single database table. In the database, the actual data for a pool table is stored in a compressed and encoded format, so a SQL query retrieves unusable binary data.
3) Cluster tables, mapping one or more application tables to a single database table. Again, in the database, the actual data for the cluster table is also stored in a compressed and encoded format.
With ECC some of the most important business data is stored in so-called cluster tables. S4/HANA has largely moved away from the use of cluster and pool tables.
Since its inception SAP developed its own programming language ABAP (Advanced Business Application Programming) to interface with its application tables, with BAPIs (Business Application Programming Interfaces) code fragments to make remote function calls. A challenge with these interfaces is that there is no CDC interface, and the BAPI is often limited in the amount of information it provides, with also no access to custom Z-columns. On top of that ABAP and BAPIs run through the SAP Application Servers, typically the heaviest-loaded part of the SAP infrastructure. ABAP and BAPIs are available on both ECC and S4/HANA.
Data Replication from SAP
HVR can replicate data from SAP. Typically HVR discovers table definitions from the source databases’ dictionary, but for SAP there is integration with the SAP dictionary that is used by the application. As a result, HVR obtains current pool and cluster table definitions, including any custom Z-columns that organizations may have added. Similar to transparent tables, the raw data for pool and cluster data is also captured using log-based CDC, with decompressing and decoding of the data taking place on the integration side (by default executed by the integration agent if one is in use), away from the source SAP system.
Sooner or later most organizations will move the application suite to S4/HANA, and with that, the underlying database technology becomes HANA. HVR’s unique log-based CDC support for SAP HANA enables customers to continue to replicate data out of the application suite into Snowflake.
Data replication and integration from most common sources to Snowflake using HVR: The details
Optimal change delivery
Identifying what changed on the source is only part of the problem. To apply change data with optimum performance and efficiency into Snowflake, HVR utilizes data staging. In AWS the staging area is S3 and in Microsoft Azure it is Blob Storage. Data is subsequently bulk loaded through the copy command into Snowflake. The final stage of applying the changes to the target tables is performed using set-based SQL merge operations, maintaining transactional consistency from the source.
Organizations often post-process data delivered by HVR for easy access through business applications, or to optimize data retrieval for analytics. ELT (Extract, Load and Transform), or ETL (Extract, Transform, Load), is used to transform the data into a different schema, possibly joining tables to denormalize the data, computing new data fields, and in some cases building aggregates. With the use of downstream ELT/ETL, HVR delivers the data into what the data warehouse literature refers to as an Operational Data Store (ODS).
Out-of-the-box, HVR provides three key capabilities to facilitate downstream ELT/ETL:
- “Soft delete” is a transformation to mark a row as deleted when it was physically deleted on the source. The soft-deleted row can now easily be identified on the target ODS as a delete that must be processed by ELT/ETL downstream. Without the soft delete, the alternative would be to use a resource-intensive query to identify what data is currently available in the target but no longer in the source.
- The ability to include metadata from the source like the commit sequence number from the source, or the commit timestamp, based upon which downstream ELT/ETL can determine what data set to process next.
- Support for the so-called agent plugin, which is a call-out to a script or program at certain moments during the data delivery, such as at the end of the integration cycle, when the target database is transactionally consistent. Customers often use an agent plugin to call their ELT/ETL routines.
End-to-end data replication
Organizations adopting Snowflake need a solution to set up end-to-end data replication between their transaction processing data source(s), and Snowflake. HVR provides such a solution with support for:
- Discovery of table definitions on the source, including the ability to capture DDL changes from several source database technologies.
- Automatic mapping of data types to compatible, loss-less data types, with the ability to create the tables in Snowflake.
- One-time load through efficient copy commands, including the ability to parallelize load across tables, and to split large tables into multiple slices. The one-time load is integrated with CDC and incremental data integration.
- Log-based CDC from several database technologies, and continuous data replication.
- Data validation – compare – to validate whether data in Snowflake matches the data in the source.
- Monitoring through HVR Insights, a browser-based environment providing a historical view into metrics aggregated by time, as well as a single topology overview.
- Automatic alerting to enable the system to operate without continuous monitoring.
- A Graphical User Interface (GUI) to configure data replication.
Cloud-based data integration requires strong security. At-rest Snowflake provides this, but you must ensure data security during ingestion. HVR provides enterprise-grade security features to avoid data breaches into Snowflake:
- AES256 encryption of data crossing the wire using unique encryption certificates.
- Support for OS-level and LDAP authentication, as well as custom plugin-based authentication.
- 2-Step verification support with unique certificates, avoiding a so-called man-in-the-middle attack.
- Pre-defined authorizations with full rights to make modifications, execute-only privileges and read-only access.
- Auditing/logging of all operations that may affect data replication.
Time for a Test Drive
Snowflake is the next-generation data warehouse technology, built for the Cloud. How can you most efficiently, and with the lowest possible latency, get data into Snowflake?
That is where HVR can help. Supporting customers generating well over 1 TB of changes per day on a single system, HVR is well-suited to help large organizations populate Snowflake from operational systems with log-based CDC support on various commonly-used relational database technologies.
Interested in seeing what data replication using HVR is like? I invite you to take HVR for a Test Drive.
Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.