Near Real-Time Data Warehouse
Near Real-Time Data Warehouses: A Primer
Business leaders are constantly on the lookout for ways to improve responsiveness to customers. And new technologies are helping them meet that objective. Take data warehousing technology. Because data warehouses once only operated in batch mode, typically processing updates at night, business managers could only use their information to address events after the fact. Today, near real-time data warehouse technology is available that updates the data warehouse far more frequently– in close to real time—so that users can respond to issues as they occur.
Here’s what you need to know to get started with this new technology.
A data warehouse (DW) or data mart is a database optimized for query retrieval. Extensive transformations are typically applied to get from application schema(s), which are typically normalized, to the more commonly denormalized data warehouse or data mart schema.
Despite the hype surrounding Hadoop, data warehouses and data marts are commonly implemented using a relational database (columnar or not) like Oracle, SQL Server, Greenplum or Teradata, or in a multi-dimensional database like SQL Server Analysis Services or Oracle Hyperion Essbase. Increasingly, data warehouses are also available in the public cloud, which makes them easier to administer and relatively inexpensive. Some vendors, like Teradata, offer hybrid solutions that run on a managed cloud, virtualized infrastructure or private cloud, and on-premises while others, such as Snowflake are cloud pure plays.
With the proliferation of low-cost cloud-based solutions, we expect to see more companies who are willing to re-evaluate their data warehouse architecture, and consider near real-time solutions.
From batch to near real-time
Traditionally data warehouses have been updated during a batch window, often daily (nightly), and in some cases even less frequently. The near real-time data warehouse eliminates the large batch window and updates the DW much closer to real-time. However, real-time updates are often impractical due to the nature of the transformations between the source transaction schema and the destination DW or data mart. Transformations may require (extensive) table joins and in some cases aggregations. Business users need to specify the real-time requirement, which can be different for different organizations.
As more data sources are hosted in the cloud, organizations will need to ensure that their near real-time solution can accommodate both cloud and on-premises based data sources.
The near real-time data warehouse often starts with an Operational Data Store (ODS) that represents an almost exact copy of the source database schema. The operational data store is fed by a real-time data replication technology (e.g. HVR) that uses log-based change data capture on the source, and real-time data integration into the ODS. One area where the ODS may deviate from the source schema is the introduction of a logical delete column (“soft delete” in HVR), which indicates whether a row was deleted on the source database. The downstream transformation process can use this logical delete information to quickly identify deleted rows instead of performing an often expensive outer join between the table in the ODS and the data warehouse table to find out which rows are no longer in the source.
Learn about the differences between a data warehouse and a data lake.
Controlling the batch load
Most real-time data replication technologies maintain transactional consistency between source and target with a committed state on the target representing a committed state of the source. The data transformation process should take advantage of this knowledge to process a consistent set of changes into the DW or data mart. You can achieve this consistent transformation in a couple of different ways such as by:
- Introducing an extra column on every table to represent the data source’s ever-increasing commit sequence number and using this as a filter to identify what rows should be processed.
- Using a plugin during the data replication to execute the transformation process at a consistent point in time.
You can find cloud-based services that will help you perform these processes. In traditional, on-premises implementations, you’ll need a separate software installation. HVR makes it straightforward to implement both options.