Home Change Data Capture for Real-time BI and Data Warehousing
Change Data Capture for Real-time BI and Data Warehousing
In this post, I talk about the benefits of using log-based CDC for a real-time data warehouse…But first, let’s start with a short history of Data Warehouses
Data warehouses have been around for decades. Teradata, the grandfather of the data warehouse data technologies, built its database on the design principle to parallelize everything, with no single bottleneck to limit performance and scalability. Early data warehouses in the 1980s and 1990s were expensive to deploy and maintain. However, with the right focus and implementation, the value obtained from the data warehouse could be tremendous, and case studies documented the successes that the likes of Walmart, AT&T and others achieved with their data warehouse efforts. The data warehouse grew in popularity in the early 2000s, with the advent of so-called data warehouse appliances from vendors like Netezza (since acquired by IBM), Greenplum (acquired by Pivotal), Vertica (acquired by HP Enterprise and sold off again to Micro Focus), and others.
Data Warehouse Design Theories
Industry experts developed design methodologies to optimally implement the data warehouse, with three dominant theories:
2) Bill Inmon advocated a normalized, third normal form (also referred to as 3NF) approach.
3) Dan Linstedt introduced the Data Vault modeling method aiming to combine the best aspects of Star Schemas and 3NF.
The methodologies discuss the need for metadata management, data cleansing as well as transformations to get the data from often highly normalized schemas in transaction processing systems, into an access structure that better fits the data consumers. Also, and not less importantly, proposed designs perform well on traditional data warehouse technologies given the expected query patterns.
Commonly-used terms to describe the data warehouse design layers include:
- Operational Data Store (ODS), storing a copy of the operational system,
- Data Warehouse (DW), or Enterprise Data Warehouse (EDW), consolidating data from multiple sources and maintaining history of the data,
- Data Marts (DM) provides a more user-friendly access method but also performance layer on top of the data warehouse. Data warehouses often include multiple data marts. Depending on the underlying technology data marts can be implemented as virtual definitions or separate physical structures.
The methods to develop a data warehouse are so well-understood and proven that companies like WhereScape have built a very successful business automating most of the steps involved in building one.
Why a real-time data warehouse?
The data warehouse and downstream BI solutions continue to be very successful analytical solutions for organizations looking to optimize core business processes, save costs, and minimize risks. Data warehouses have traditionally focused on consolidating data from a variety of transactional systems. Packaged Enterprise Resource Planning (ERP) applications, Supply Chain Management (SCM) solutions, Customer Relationship Management (CRM) data, but also many industry-specific and home-grown data sources feed the data warehouse. The need to transform the data into the data warehouse continues to be central to many data warehouse initiatives.
Not too long ago ETL jobs used to run once a day to populate analytical systems. The once-a-day approach worked well because systems typically had a period during the day (or rather, night) when the system was not very active, and data extract jobs – that would put a lot of load on the source transactional systems – could run. More recently, with systems more and more active 24×7 in a global, connected world, it is becoming less and less acceptable to initiate heavy data extraction jobs. Also, organizations see value in quicker access to analytical data to outsmart the competition, lower fraud, etc. Data closer to real-time is highly beneficial to such objectives, but of course, transformations still have to be performed.
Why log-based change data capture?
This is where heterogeneous data replication technology like HVR’s is extremely useful. With ongoing real-time data replication using log-based change data capture, there is no need for a regular bulk load between the source database and the ODS, and changes can be processed much closer to real-time.
HVR performs log-based change data capture (CDC) on several different database source technologies, including Oracle, SQL Server, all flavors of DB2, PostgreSQL, SAP HANA, MySQL and more. Log-based CDC, in contrast to trigger-based CDC, is asynchronous and has absolute minimal impact on the transaction processing applications. I have written several blog posts about log-based CDC, which you can find here:
Aside from offering log-based CDC for your data warehouse, the following are some additional benefits of using HVR for data replication over many alternative data replication options:
- HVR supports a multitude of technologies as a destination, on-premises and in the cloud, with full support to mix source and target technologies heterogeneously. Go here for a list of technologies HVR supports.
- Integration into the ODS is optimized for the destination technology, so that columnar and/or Massively Parallel Processing (MPP) databases can still keep up with high transaction volumes coming in from multiple sources, in near real-time.
- Based on the source table definitions, HVR will automatically map source table data types to compatible data types on the target with the ability to create the target table definitions.
- Initial load through HVR is aligned with ongoing CDC so that no source transactions are missed and there is no overlap between initial load and continuous integration.
- To validate the data – also in a heterogeneous configuration – HVR provides a powerful and flexible compare and repair capability.
- HVR can track actual transactions on the source systems, HVR can (very frequently) kick off ETL/ELT at a transactionally consistent point in time. This avoids users potentially seeing an updated view of some tables but stale data in related, other tables.
- Data delivery into the ODS can be configured to mark rows as deleted instead of actually processing delete operations as deletes. Such built-in transformation simplifies and optimizes downstream ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform).
- Also included as part of HVR, is an out-of-the-box transformation that creates a full audit trail of changes on source database records, including metadata about the change such as the system commit number, the type of operation, and the commit time on the source. This is particularly useful if you use the Data Vault design methodology.
To learn more about data replication in heterogeneous environments to support your real-time BI and data warehousing initiatives please sign up for one of our regular demos.
Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.