Diving into the Data Lake
The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.
Source: Gartner.com, April 2015
The concept of the enterprise data lake is one of the most talked about ideas in the modern data-warehousing world. (If you’re not already familiar with how these two differ, review our data lake vs data warehouse post.) It is also one of the most divisive concepts with analysts, vendors and users split on whether this approach is an analytics breakthrough or an enterprise-level garbage bin for data that will never be looked at again.
And as usual in these controversies, both sides have a point. The key is using this approach where it fits the need. When an organization needs access to a large pool of diverse data for which the schema and data requirements cannot be defined until the data is queried, a data lake can be an excellent solution.
- Let’s consider the example of a modern aircraft engine that generates Terabytes of data on every flight. The aircraft manufacturer uses the collected data to find out how to run the engine as efficiently as possible and to identify patterns of failure. Given the cost of operating an aircraft engine and even more so the cost of downtime, it is no surprise that customers (read: airlines) are willing to pay for this kind of data.
- But the story does not end there. The aircraft manufacturer will know a lot of factors of the engines it produces – and sold to many of its customers. Think about upcoming maintenance. If the manufacturer knows that 1,000 turbine blades will have to be replaced in the next 3 months then it better make sure these blades are produced in time to deliver the maintenance, or (still) face costly outages as an alternative. I.e. analysis of sensor-generated data impacts operational processes, and touches on “boring” ERP data like parts inventory and supplies. And with that a massive analysis of Internet-of-Things (IoT) data arguably becomes a query that should be related to data from the ERP and other traditional systems.