ETL + Log-Based CDC = Maximum Value
The benefits of combining Talend’s ETL Solution with HVR’s CDC solution
It’s been several years since the phrase “Data is the new oil” was . In today’s digital enterprise this statement is arguably more applicable than ever with data enabling so many business models and service opportunities.
However in contrast to oil:
–Data can be consumed more than once.
–Data is not running out, and rather we are getting more by the day. At current consumption rates, based on known reserves, we will be running out of realistically winnable oil within the next century.
–Data is more valuable when it is current and often loses value as it ages. Oil, of course, takes a very long time to develop and retains its value as it ages.
Similar to oil though, data can be “refined” — transformed — and become more valuable than oil in its raw, original state.
How can you get data in near real-time, transformed, to get maximum value? That is where collaboration between HVR and Talend comes in. HVR is a leader in real-time data replication between databases in heterogeneous environments. Talend is a leader in transforming data as part of Extract, Transform, Load (ETL) for data integration. When you combine the technologies from HVR and Talend, you get transactionally consistent data, in near real-time, that is transformed and ready for consumption by business users, analytical applications, and machine learning algorithms.
How does it work?
The most important aspect of data replication is the ability to identify — and capture — changes. This concept is known as Change Data Capture (CDC). CDC can be achieved in multiple different ways. Log-based CDC, a method to extract changes from a transaction log, is generally considered the preferred approach for CDC from relational transaction processing databases.
Log-based CDC provides the following benefits:
–Minimal impact on the database
–The ability to identify transaction boundaries
–Loss-less data extraction, since the database technology uses its transaction log for its own recovery in case of a crash
HVR supports log-based CDC from most commonly used relational database technologies including Oracle, SQL Server, SAP HANA, all flavors of DB2, PostgreSQL, MySQL/MariaDB, and more.
HVR delivers — integrates — data into a variety of data management technologies including:
- Any one of the source database technologies supported by HVR. (SQL Server, Oracle, HANA)
- Analytical relational databases like Snowflake, Teradata, Redshift, and Greenplum.
- Distributed file systems commonly used to build data lakes like AWS S3, Microsoft Azure Data Lake Store (ADLS) and Blob Storage, the Hadoop Distributed File System (HDFS), and Google Cloud Storage (GCS).
- Kafka for data streaming use cases.
With capture and integration de-coupled, data can be captured once and delivered multiple times. Limited transformations are possible, including:
- Mapping between source and target schemas or table names.
- Filtering rows or eliminating columns.
- Processing deletes as updates (so-called soft-delete).
- Keeping an audit trail of changes (so-called TimeKey integration)
- Adding metadata columns containing metadata like the operation type (insert, update or delete), the order of the change, the commit number of the transaction, change or integration timestamp, etc.
More extensive transformations involving table joins or aggregations are possible but very compute-intensive in an environment with a lot of data changes.
ETL has been around for as long as relational databases. Traditionally (and still today) a lot of ETL is performed through scripts written by the DBA, the data analyst, or a data architect. Talend was founded in 2005 when its founders identified a clear need for a better data integration solution. Today Talend has one of the largest numbers of end-point data connectors available to consolidate data from multiple sources into one or more destinations.
ETL used to be the way to get data from the transactional database into a data warehouse or data mart. Data was extracted overnight or even less frequently, when operational databases weren’t otherwise used, so there was no concern for the relatively heavy load on the source systems. Before data became the new oil this was good enough to satisfy the business needs. With organizations growing in size and operating 24×7 on the internet, there isn’t necessary idle time on most transactional systems anymore. Additionally, growing data volumes caused the volume of extracted data to increase, and now organizations can no longer afford the impact of bulk extraction on their busiest data sources. With systems active all the time, it has become more difficult to extract a transactionally consistent view of a system which is a big problem in certain industries like finance.
HVR and Talend
Traditional data warehouse methodologies prescribed the use of:
- Operational Data Store (ODS) as a copy of the operational database.
- Data Warehouse, transforming data and also adding history to the ODS.
- One or more data marts, to facilitate end-user access and applications.
In such a scenario HVR would populate the ODS (with the exception of adding soft deletes to facilitate the processing of delete operations) and Talend performs the downstream data transformations into the data warehouse and beyond. HVR supports the concept of a so-called agent plugin, which is the ability to make a callout (e.g. to process Talend transformations) at a transactionally consistent point in time.
More recently a lot of data warehouses don’t necessarily start from an ODS anymore but rather from a data lake. Several joint customers use HVR to populate the data lake, with Talend performing downstream transformations into a data warehouse, data mart or analytical applications.
Beyond data integration, Talend provides data profiling capabilities, a data quality and governance solution, and data preparation and stewardship features. These powerful capabilities are complementary to CDC from HVR and data integration through Talend, as you prepare data from multiple disparate sources into analytical environments.
Getting started with HVR and Talend is easy thanks to its open-source Talend Open Studio available as a free download. To experience HVR, we invite you to take a test drive.
Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.