Learn how HVR retrieves up-to-date definitions for pool and cluster tables using log-based change data capture

Ubiquity of SAP across the Enterprise

What does 3M, BMW, The Coca-Cola Company, DHL, Ford Motor Company, Airbus, The Dow Chemical Company, and Samsung Electronics have in common?

All run SAP’s Enterprise Resource Planning (ERP) applications. In fact, the University of Michigan maintains a much longer list of SAP customers. Many well-known, well-respected and successful organizations rely on SAP applications for their primary business processes.

SAP ERP Origins

SAP Enterprise Core Components (ECC), first released in 2004, is today’s most-commonly deployed version of SAP and will be supported until (at least) 2025. SAP ECC runs on several database technologies including Oracle, IBM DB2, Microsoft SQL Server, SAP’s Sybase, and starting with Enhancement Pack 6 for ERP 6.0, released in 2012, SAP HANA.

S4/HANA, first released in 2015, is SAP’s latest edition of the ERP applications and runs only on SAP’s own HANA database technology.

Once organizations commit to using SAP applications they start following SAP’s release cycles, given how ingrained the ERP system is in the primary business processes. Organizations using SAP ECC today will sooner or later move to S4/HANA (and beyond).

Analytics on SAP data

Due to its central role in organizations’ primary business processes, it is safe to say that key analytical environments, including data lakes, data warehouses, or streaming data applications, include data residing in SAP. Off the shelf,  SAP provides its Business Warehouse (BW) for SAP data. Organizations, however, often have requirements to include non-SAP data in their analytical environments, and they generally don’t extend BW with large amounts of external data. 

With lots of data in the data warehouse and the need for scalable data processing resources, the cost of running the entire data warehouse in BW—nowadays often on the HANA in-memory database—is an important consideration. There are cost-effective flexible data lake, data warehouse, and streaming data solutions are readily available in the cloud, including:

  • File-based solutions like Amazon S3, Azure Data Lake Store (ADLS) and Google Cloud Storage (GCS)
  • Database technologies like Snowflake, Amazon Redshift, Azure Synapse Analysis (formerly Azure Data Warehouse) and Google BigQuery
  • Streaming data solutions like Amazon MSK, a hosted version of open-source Kafka, Amazon Kinesis, Azure Event Hub and Google Dataflow.

With SAP Applications running on top of a relational database, there are many tools and techniques to integrate data from relational databases into data lakes, data warehouses and streaming data applications. What kind of tables do you find in an SAP ERP database?

SAP ERP Tables

With nearly 50 years of incremental innovations, some of the core of the SAP ERP applications date back to the days when relational transaction processing databases provided limited capabilities and a multitude of limitations like a relatively small (workable) number of tables per schema and restrictions on the length of table names, as well as the number of columns per table, and supported data types. Built on top of this legacy, SAP provides its own dictionary as part of the application suite, to map application tables and columns and data types to database tables and columns with often abbreviated (in German) 4-5 character names. Websites have been developed to help data engineers make sense of the SAP table structures.

The SAP ECC suite features three different kinds of tables:

  1. Transparent tables, for which an application table maps one-to-one to a database table and a database-level query retrieves the actual data visible to the application.
  2. Pool tables, mapping multiple application tables to a single database table. In the database, the actual data for a pool table is stored in a compressed and encoded format, so a SQL query retrieves unusable binary data.
  3. Cluster tables, mapping one or more application tables to a single database table. Like with pool tables, in the database, the actual data for the cluster table is stored in a compressed and encoded format. Some of the most important data in the applications are stored in cluster tables.

Upon moving its applications to HANA, SAP is aiming to eliminate all cluster and pool tables.

Getting data out of SAP

How do you extract data from SAP applications so that you can build your cloud-based data lake, data warehouse or streaming data application? SAP recommends using its proprietary Advanced Business Application Programming (ABAP) language, or call code fragments in this language called Business APIs (BAPIs), as Remote Function Calls (RFCs). ABAP and BAPIs run through the SAP application servers, the second and typically most loaded tier in SAP’s three-tier architecture. ABAP uses batch interactions, and (out of the box) BAPIs retrieve limited information and no additional columns, that may have been added in your SAP environment as customizations.

For real-time replication from non-HANA sources, SAP provides (Sybase) Replication Server but it does not support pool tables. Also, SAP Replication Server has had very limited enhancements since the Sybase acquisition in 2010, with Replication Server supporting predominantly transaction processing databases like Oracle, DB2, SQL Server and Sybase ASE, and as a target also HANA and Sybase IQ. Commonly-used modern analytical platforms like cloud-based file systems (S3, ADLS or GCS), analytical databases (Snowflake, Redshift), or streaming data solutions like Kafka, are not supported. 

For replication out of SAP HANA as a source (as well as for non-HANA sources), there is the SAP Landscape Transformation (SLT) Replication Server which uses a trigger-based approach to capture changes and hence impacts the transactions on the source. However, log-based Change Data Capture (CDC) is generally considered a superior approach for capturing changes.

Data Replication from SAP

HVR_SAP_diagramWith support for log-based CDC from many commonly-used transaction processing databases, including SAP HANA, HVR provides a strong alternative for data replication from SAP applications. The technology integrates with the SAP dictionaries to retrieve up to date definitions for pool and cluster tables, including any custom Z-columns that may have been added to the tables. Cluster and pool tables are subsequently decoded downstream in the replication flow, away from the SAP applications, without relying on ABAP or BAPIs.

In addition, HVR supports many modern cloud-based platforms commonly-used for data lakes, data warehouses and streaming data applications, including:

  • cloud file systems S3, ADLS, Azure Blob Storage and GCS
  • analytical databases Snowflake, Redshift, Synapse, BigQuery and more
  • Kafka

HVR’s rich data replication capabilities enable SAP customers to not only build analytical solutions using SAP ECC data today but also going forward as organizations choose to adopt SAP HANA and S4/HANA.

Data Replication beyond SAP

Not only is HVR a strong technology for log-based CDC and continuous data movement from SAP applications in heterogeneous environments, but also provides generic end to end support for:

  • Discovery of table definitions from supported databases and technologies
  • Automatic mapping of data types between sources and targets in a heterogeneous configuration, delivering loss-less data transfer
  • One-time load, aligned with CDC and continuous data integration
  • Log-based CDC from many different database technologies, and optimized data delivery into all supported targets
  • Data validation between source and target, including cluster and pool tables
  • Graphical User Interface (GUI) to configure and manage data replication
  • Browser-based Insights into the replication topology, as well as time series charts showing data replication statistics
  • Automated alerts for lights-out management of data replication

Many HVR customers take advantage of our support for SAP applications, both on SAP ECC using the ability to decode data in cluster and pool tables, and on SAP HANA. Dominant use cases are data lakes, data warehouses, and reporting environments.

To get a feel for replicating data using HVR, we invite you to take a test drive.

 

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

© 2020 HVR

Test drive Contact us