An Agile Approach to Cloud Data Warehouse Adoption

 

With the volume of data across every enterprise growing at hyper speed, organizations are seeking to modernize their existing data warehouse infrastructure in order to quickly scale and deliver value to business users.

Today, organizations are increasingly driving new and expanding data warehouse deployments to the cloud. Cloud-native data warehouses such as Snowflake, Amazon Redshift, and Azure Data Lake Storage Gen2, outshine their on-premises counterparts in ease of elasticity, implementation/deployment, scalability, and security. With a cloud-native data warehouse, organizations no longer need to purchase, install, configure or manage the infrastructure and personnel required for data warehouse administration, scaling, optimization, availability, and data protection.

But as organizations hoping to deliver data as a lifeline for their business move to cloud-based data warehouses en masse, they may want to reconsider the tools they use to manage their data warehouse stack.

ETL for Cloud-Based Data Warehousing: A Square Peg in a Round Hole?

For the past two decades, organizations have used the extract, transform, load (ETL) approach to provide business users with clean and ready-to-analyze data. These ETL solutions move data from application databases to staging areas where data is then profiled, cleansed, cataloged, transformed, aggregated, governed and curated before being loaded into data warehouses for access by end-users.

But in the age of cloud-based data warehousing, this approach falls short in delivering the necessary agility for all operations from data ingest to data consumption.

Icon_clock_blueTake the legacy ETL example where a highly optimized select query is executed against the source database and then returns transactions within a specified batch window. Such an operation typically relies on a last update timestamp or a where clause that the database must process before returning the query output. The query output must then be moved across the network (WAN in case of cloud data warehousing) and loaded into staging tables. If the transaction processing application is very busy, the database can return several terabytes of data within the batch window. Depending on the optimizations in the query, the presence (or lack of) primary keys and unique indexes, joins, sorts, included data types and so on, the batch query can significantly impact the database and therefore application performance.

When an application is customer-facing and revenue-generating, most DBA’s kill queries that impact OLTP database performance. Additionally, when deep and wide result sets generated by query outputs are transferred over wide area networks, the available network bandwidth can easily become a bottleneck and cost-prohibitive. When such large datasets are bulk inserted into a cloud data warehouse, compute costs can quickly add up over the batch windows.

Because data is transformed, curated, or embellished before business users consume it from the data warehouses, these solutions cannot easily make available transient updates to rows, soft deletes, deletes, and slowly changing dimensions.

Additionally, with the legacy ETL approach, data is often denormalized and aggregated for read optimization, storage, and infrastructure considerations based on assumptions of the type of queries business users will execute against the data warehouse. Today’s business users are looking to continuously “evolve” requirements—requiring schema changes, new tables, new slicing and dicing options and so on. As a result, the traditional approach is not nimble enough for continuous integration and continuous delivery approaches for data warehouse developers

Meet ELT for Cloud Data Warehousing

Today, a new approach is available for data migration and transformation that extracts data, loads it into the data warehouse and then performs the necessary transformations.

These solutions take advantage of new capabilities found in cloud-based data warehouses/data lakes like Snowflake. These modern data warehouses deliver massive processing capabilities with elastic compute elements to manage today’s data volumes, eliminating friction in the data delivery pipeline. Data is ingested straight from the busy application database into the cloud data warehouse or the data lake. There is no need to stage, denormalize, or pre-aggregate in the staging area before loading data into its final destination. With raw data available in the data lake, the solution applies transformation routines, schema filtering, sorting, and curation routines at run time. This solution preserves the raw nature of data, and data curation is performed on-demand based on business users’ evolving needs.

Log-based CDC and ELT: Kicking Data Ingest Up a Notch

Change Data Capture (CDC) enhances ELT by transferring data from the source to a cloud data lake in real-time, as transactions are committed on the source systems, providing users with instant access to information in the data warehouse.  Because only changed data moves over WAN, CDC eliminates the need to move a large, resource clogging batch of data over the network at preset intervals.

Because log-based CDC reads directly from the source system’s I/O cache, or from recovery logs, not from the database tables or APIs, it does not compete with the source system for disk or compute resources resulting in zero impact on the source database. Data is efficiently compressed on the source before it is sent across the network to further reduce bandwidth requirements and latency over long distances.

Diagram_WhereScape_Snowflake_HVR

HVR’s partners with WhereScape to provide a complete ELT solution

Wherescape_350x263HVR partners with WhereScape to provide a complete ELT solution for moving data to powerful, modern cloud-based data warehouses in real-time. HVR’s log-based CDC moves data from your operational systems to the cloud, data lake, or data warehouse (providing all the benefits described earlier). WhereScape performs data transformations and other downstream curation processes, automated workflows and documentation to push the data to the business layer where it can be assessed by business intelligence and analytics solutions.

 

Together, HVR and WhereScape allow organizations to update the way they transfer data to keep up with advances in cloud-based data warehousing technology. Now organizations can meet customer demands for more timely information while minimizing the impact of data transfer on network bandwidth and performance.

For more information about how to modernize and optimize your data warehouse approach, contact us.

About Sreevani Abbaraju

Sreevani is the Director of Product Management for HVR.

© 2019 HVR

Try now Contact us