Data Lake or Data Warehouse?
Abandoning one for the other may not be the answer
Over the last few decades we’ve seen an explosion in the use of Data Warehousing within the IT industry; it was all the rage for doing analytics and reporting. This was a popular concept as companies could perform cross analysis of their large amounts of data quickly and efficiently to support management’s strategic decision-making process.
However, in the last few years we have seen a disruption in the technology landscape, the emergence and adoption of new technologies like Hadoop, NoSQL and analytical appliances – all for very good reasons such as cheaper storage, better performance for specific types of reporting, support for new sources like sensor data, social media, geolocation, video, web logs etc., which are unstructured or semi-structured in format.
This unearths some of the complexities surrounding data warehousing. The type of data entered into a data warehouse must be organized and structured to fit the data model, going through a transformation process to ensure it is cleansed and fit for purpose. As well as challenges with the variety of data, there is also the volume of data we are now generating, doubling every 18 months. For companies, it is a question of cost, not only on hardware, but also workforce supporting the implementation.
With this in mind, it is no wonder why some may think to replace their data warehouse with an enterprise data lake built around these new technologies. But with all these positives, it begs the question, “Can I replace my data warehouse?”
From what we have observed in the market place and talking to customers about their logical reference architecture, there is still a need for data warehousing. For all of Hadoop’s hype, it is still in its infancy to generate the kind of performance for doing complex queries and mixed workloads, lacking the kind of features that made data warehousing a must. e.g. optimal indexing strategies, efficiently performing complex table joins with a range of terabytes of data, and an optimizer for determining the best path for queries.
Advantages and disadvantages of the Data Lake and Data Warehouse:
|Data Warehouse||Vs.||Data Lake|
|Rigid, structured and needs to be processed||DATA||Structured, semi-structured and unstructured in its raw format|
|Schema on write||DATA ANALYSIS STRATEGY||Schema on read|
|Enterprise, strategic reporting||REPORTING||Discovery, operational reporting|
|Expensive as data volumes grow||STORAGE||Used commodity hardware which is typically cheaper|
Data Lake + Data Warehouse = Biggest Impact
But when you combine all these technologies together you eliminate all the disadvantages and reap all the benefits. Granted that not all companies will require all these technologies in a single moment, what technology you deploy will be based on your end-user requirements and data you are pulling in. But what is fundamental in these architectures is the combination of a data lake and data warehouse working in a unified manner.
The old concept of having a staging area within a data warehouse is replaced by the data lake, allowing for all forms of data to be ingested in its original format and stored on commodity hardware to lower the cost of storage. This gives the business operational use of their raw data to perform discovery/ad hoc analytics, looking for patterns and finding the questions that the business doesn’t yet know to ask.
The raw data can then be massaged and refined to be loaded into the data warehouse and blended with data from other functions for analysts to perform more strategic reports, asking the questions the business already knows like “What will my sales figures look like next year?”.
This has opened up more use cases for the data lake such as customer 360, predictive maintenance and risk/fraud detection as we ingest more and more data. This in itself presents new challenges for IT in getting the data in a timely manner to meet the needs of the business. We have already tackled 2 of the 3V’s of Big Data; volume and variety. Velocity is all about speed and making the data available in real-time. Current methods of ingesting data into the data lake are batch oriented by nature, methods like ETL tools or Sqoop.
Typically, these methods are not great at real-time or on-demand data access, where quick response to the data is required. Companies need real-time data integration, whether that’s feeding the data lake, or the data warehouse.
If you are exploring options for data integration, contact us for a free consultation. We’d love to chat.