Data Lake or Data Warehouse?
Abandoning one for the other may not be the answer
Data warehouses have been around for a long time. The data warehouse concept dates back to the 1970s and 1980s with initial references to Decision Support Systems. Teradata was an early player in the database market for Data Warehousing, with powerful yet expensive technology.
Over the last few decades, we’ve seen an explosion in the use of data warehousing within the IT industry; it was all the rage for analytics and reporting. Data warehouse appliances like Netezza, Greenplum, Aster Data, and Vertica developed hardware and software optimized for analytical workloads. The data warehouse was a popular concept as companies could perform cross-analysis of their large amounts of data quickly and efficiently to support management’s strategic decision-making process.
However, Hadoop and related Apache technologies disrupted the technology landscape. Cheaper storage and scalability for specific reporting types enabled support for new sources like sensor data, social media, geolocation, video, weblogs, etc., which are unstructured or semi-structured in format. Concepts of the Hadoop and related Apache Technologies have been embraced by the major cloud vendors like AWS, Azure, and GCP. Storage and compute services with equivalent capabilities to the Hadoop ecosystem are now available on-demand with pricing based on usage.
This unearths some of the complexities surrounding data warehousing. The type of data entering the data warehouse must be organized and structured to fit the data model, going through a transformation process to ensure it is cleansed and fit for purpose. In addition to challenges with the variety of data, there is also the volume of data we are generating, doubling every 18 months. For companies, it is a question of cost, not only on hardware but also workforce supporting the implementation.
With this in mind, it is no wonder why some may think to replace their data warehouse with an enterprise data lake built around Hadoop and related technologies. But with all these positives, it begs the question, “Can I replace my data warehouse?”
Can I replace my data warehouse?
From what we have observed in the market and talking to customers about their logical reference architecture, there is still a need for data warehousing. For all of Hadoop’s hype, to generate the kind of performance for doing complex queries and mixed workloads, it lacks the type of features that made data warehousing a must. e.g., optimal indexing strategies, efficiently performing complex table joins with a range of terabytes of data and an optimizer for determining the best execution path for queries.
Advantages and disadvantages of the data lake and data warehouse:
|Data Warehouse||vs.||Data Lake|
|Rigid, structured, and needs to be processed||DATA||Structured, semi-structured, and unstructured in its raw format|
|Schema on write||DATA ANALYSIS STRATEGY||Schema on read|
|Enterprise, strategic reporting||REPORTING||Discovery, operational reporting|
|Expensive as data volumes grow||STORAGE||Used commodity hardware which is typically cheaper|
Data Lake + Data Warehouse = Biggest Impact
When you combine all these technologies, you eliminate the disadvantages and reap all the benefits. This is what some of the relatively new entrants to the market have done. Snowflake created a cloud-based data platform following the traditional data warehouse concepts, making it suitable to store and analyze unstructured data.
Founded by the original creators of Apache Spark, Databricks started with support for any type of data in cloud storage, and the unlimited scalability leveraged Hadoop technologies. However, with the introduction of ACID transactions on its Delta Lake in 2019, Databricks is making its technology look and feel more like an analytical, relational database.
Does blending the data warehouse and data lake eliminate the need to curate and cleanse data warehouse data? No, but with all data in a single location, at least you can leverage the power of the underlying scalable platform and get the data into your data warehouse with lower latency.
What technology you deploy will be based on your end-user requirements, and the data you are pulling in. But what is fundamental in these architectures is the combination of a data lake and data warehouse working in a unified manner.
If you are exploring options for data integration, contact us. We’d love to chat.