Mark Van de Wiel

Several TB of Data Changes Per Day? No problem.

How HVR enables high volume data replication from systems such as SAP into Snowflake

Snowflake is a modern data platform built for any cloud. With storage separate from computing resources, it provides sheer unlimited scalability, on-demand, with no need to tune the system. Its flexibility to ingest data of virtually any kind easily, makes it a very popular database for organizations worldwide. And its unique Secure Data Sharing technology revolutionizes how organizations leverage and share data.

Snowflake is one of many target technologies that HVR supports. With the increase in popularity for the platform, why do customers use HVR’s heterogeneous data replication to ingest data into it?

In this post, I share some of the common reasons customers such as 1-800-Flowers and Pitney Bowes chose HVR for data replication from SAP HANA® to Snowflake.

Change Data Capture

Large organizations (still) perform a lot of their primary business processes through traditional transaction processing databases. These databases, critical business operations, have grown large over time, and now process large volumes of changes. Organizations have a need to include data from these systems in their cloud data warehouse or data lake in Snowflake but cannot afford a significant incremental load on the database. Log-based Change Data Capture (CDC), in which changes are identified by parsing the database’s transaction logs, is generally considered the approach with the lowest overhead to individual transactions, but also to the system overall.

HVR performs log-based CDC on a number of different database platforms, with a recommendation to deploy agents for optimum performance and efficiency and to distribute the load. To address many organizations’ preference to avoid any extra load on the mission-critical transaction processing database systems, depending on the database technology, HVR supports alternative approaches to perform log-based CDC including capture on a standby system, or by just parsing the backups of the transaction log on a separate system altogether.

SAP to Snowflake

Large organizations often use the SAP ERP suite. For years SAP has been running a campaign “The best-run businesses run SAP“, and its blog adopted this phrase as its title. At present, SAP ECC (Enterprise Core Components), released in 2004 but still supported today, is the most commonly deployed SAP version. SAP ECC is supported on a number of different database technologies including Oracle, SQL Server, DB2, Sybase (now owned by SAP), and SAP HANA. SAP’s release after ECC is called S/4HANA and is only supported on the SAP HANA database. SAP is working with its customers to migrate ECC deployments to S/4HANA by the end of 2027 when regular support for ECC ends.

At a high level the SAP ECC suite features three different kinds of tables:

1) Transparent tables, for which an application table maps one-to-one to a database table and a database-level query retrieves the actual data visible to the application.

2) Pool tables, mapping multiple application tables to a single database table. In the database, the actual data for a pool table is stored in a compressed and encoded format, so a SQL query retrieves unusable binary data.

3) Cluster tables, mapping one or more application tables to a single database table. Again, in the database, the actual data for the cluster table is also stored in a compressed and encoded format.

With ECC some of the most important business data is stored in so-called cluster tables. S/4HANA has largely moved away from the use of cluster and pool tables.

Since its inception SAP developed its own programming language ABAP (Advanced Business Application Programming) to interface with its application tables, with BAPIs (Business Application Programming Interfaces) code fragments to make remote function calls. A challenge with these interfaces is that there is no CDC interface, and the BAPI is often limited in the amount of information it provides, with also no access to custom Z-columns. On top of that ABAP and BAPIs run through the SAP Application Servers, typically the heaviest-loaded part of the SAP infrastructure. ABAP and BAPIs are available on both ECC and S/4HANA.

Data Replication from SAP

HVR can replicate data from SAP. Typically HVR discovers table definitions from the source databases’ dictionary, but for SAP there is integration with the SAP dictionary that is used by the application. As a result, HVR obtains current pool and cluster table definitions, including any custom Z-columns that organizations may have added. Similar to transparent tables, the raw data for pool and cluster data is also captured using log-based CDC, with decompressing and decoding of the data taking place on the integration side, away from the source SAP system.

Sooner or later most organizations will move the application suite to S/4HANA, and with that, the underlying database technology becomes HANA. HVR’s unique log-based CDC support for SAP HANA enables customers to continue to replicate SAP data into Snowflake.

Data replication and integration from most common sources to Snowflake using HVR:

Icon_clock_blueOptimal change delivery

Identifying what changed on the source is only part of the problem. To apply change data with optimum performance and efficiency into Snowflake, HVR utilizes either Snowflake’s internal, or explicit external data staging. In AWS the staging area is S3, in Microsoft Azure, it is Blob Storage, and in the Google Cloud Platform, it is Google Cloud Storage (GCS). Data is subsequently bulk loaded through the copy command. The final stage of applying the changes to the target tables is performed using set-based SQL server merge operations, maintaining transactional consistency from the source.

 

Transformations

Organizations often post-process data delivered by HVR for easy access through business applications, or to optimize data retrieval for analytics. ELT (Extract, Load and Transform), or ETL (Extract, Transform, Load), is used to transform the data into a different schema, possibly joining tables to denormalize the data, computing new data fields, and in some cases building aggregates. With the use of downstream ELT/ETL, HVR delivers the data into what the data warehouse literature refers to as an Operational Data Store (ODS).

Out-of-the-box, HVR provides three key capabilities to facilitate downstream ELT/ETL:

  • “Soft delete” is a transformation to mark a row as deleted when it was physically deleted on the source. The soft-deleted row can now easily be identified on the target ODS as a delete that must be processed by ELT/ETL downstream. Without the soft delete, the alternative would be to use a resource-intensive query to identify what data is currently available in the target but no longer in the source.
  • The ability to include metadata from the source like the commit sequence number from the source, or the commit timestamp, based upon which downstream ELT/ETL can determine what data set to process next.
  • Support for the so-called agent plugin, which is a call-out to a database stored procedure, or a script or a program at certain moments during the data delivery, such as at the end of the integration cycle, when the target database is transactionally consistent. Customers often use an agent plugin to call their ELT/ETL routines.

 

End-to-end data replication


Organizations adopting Snowflake need a solution to set up end-to-end data replication between their transaction processing data source(s), and Snowflake.

For Snowflake integration, HVR supports:

  • Discovery of table definitions on the source, including the ability to capture DDL changes from several source database technologies.
  • Automatic mapping of data types to compatible, loss-less data types, with the ability to create the tables.
  • One-time load through efficient copy commands, including the ability to parallelize load across tables, and to split large tables into multiple slices. The one-time load is integrated with CDC and incremental data integration.
  • Log-based CDC from several database technologies, and continuous data replication into cloud-based data warehouses.
  • Data validation – compare – to ensure data in Snowflake matches the data in the source.
  • Monitoring through HVR Insights, a browser-based environment providing a historical view into metrics aggregated by time, as well as a single topology overview.
  • Automatic alerting to enable the system to operate without continuous monitoring.
  • A Graphical User Interface (GUI) to configure data replication.

 

Security

Cloud-based data integration requires strong security. At-rest Snowflake provides this, but you must ensure data security during data transfer and ingestion. HVR provides enterprise-grade security features to avoid data breaches:

  • AES256 encryption throughout the entire data replication pipeline, using industry best practices including a wallet and unique encryption certificates.
  • Support for OS-level and LDAP authentication, as well as custom plugin-based authentication.
  • 2-Step verification support with unique certificates, avoiding a so-called man-in-the-middle attack.
  • Pre-defined authorizations with full rights to make modifications, execute-only privileges, and read-only access.
  • Auditing/logging of all operations that may affect data replication.

 

Time for a Test Drive

Snowflake is the next-generation data platform, built for any cloud. How can you most efficiently, and with the lowest possible latency, get data into Snowflake?

That is where HVR can help. Supporting customers generating multiple TBs of changes per day on a single system, HVR is well-suited to help large organizations populate Snowflake from operational systems with log-based CDC support on various commonly-used relational database technologies.

Interested in seeing what data replication using HVR is like? I invite you to take HVR for a Test Drive.

Related Resources:
Transforming SAP Data into Insights with HVR and Snowflake (Part 1)
Modeling SAP Data in Snowflake (Part 2)

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

Test drive
Contact us