From ETL to Pipeline: The Evolution of Integration
Back in the day, there were only a couple of ways to move data around an organization. With Change Data Capture (CDC), instead of uploading the whole file again with quarterly or weekly sales, you would only transport and load the changed data. But by large, the most common form of data movement was Extract, Transform, Load (ETL). With ETL, data was extracted from an enterprise planning system, transformed into a shape that would fit into a data warehouse, a relational model, or table-type form, then loaded for analysis.
Over time, there was a need for more datasets, and business users were clamoring to get their data into a warehouse. This led to batch windows. A batch window refers to a specific timeframe that a batch of data was pulled, transformed, and loaded into the data warehouse or other system. As businesses became more data-driven, they were left with hundreds, if not thousands of batch windows that had to be met. If something broke, it cascaded into a domino effect, leaving the end-user very unhappy.
The world of big data fundamentally changed how this is done. The systems that we had in place could not handle the size, velocity, and variety of big data.
Now that we’re in the cloud-enabled world, we’re seeing mind-blowing innovations spinning out of Google, Amazon with Amazon Web Services, and Microsoft. During this episode of the Query This podcast, host Eric Kavanagh spoke with Mike Pickett, SVP of Business Development at Talend and Mark Van de Wiel, CTO at HVR about how modern “data pipelines” are leading to new business models, the reinvention of old business processes, and cost savings.
Here are a few key takeaways from the episode.
The Data Pipeline
A data pipeline is the fluid, seamless movement of multiple data streams from one system to another. Pipelining is about putting the pieces together. As businesses integrate hybrid and multi-cloud solutions with their on-premise platforms, they have to find a way to orchestrate the movement of data between and amongst these systems. It’s not just a slight evolution of the ETL model; it’s an order of greater efficiency in moving and integrating data at scale.
When businesses move from an existing legacy pipeline to a new pipeline using newer technologies, it creates an opportunity to dramatically increase data quality, reduce development time, and increase the extensibility of the pipeline for the future. As well, the ability to change the cadence of how data is processed.
On the flip side, businesses with established pipelines, will often take their data and push it straight through the pipeline as quickly as possible into the hands of the business users. Or they’ll move the data straight into programs for analysis, transformation, or apply a data mining model or machine learning for more in-depth analysis.
The ability to leverage new datasets for additional context at scale opens the door to new business processes and new ways of solving old problems. If you look at the financial services industry, nimble and data-intensive startups like Fundbox are coming in and lending money to corporations in a much more agile way than a Chase Bank or a Wells Fargo.
Data is front and center for a lot of businesses. Cloud technologies enable enterprises to collect, store, and process data at unprecedented volumes. With that, comes an appetite, and it’s an extreme one, for getting more data faster and ensuring it is real-time.
As ships were coming into port with liquid petroleum, one client had to answer crucial questions. Which ship is coming? How long is it going to take the vessel to offload? How long will it take the petroleum to get to where it needs to go? The client used cloud technologies and pipelines to not only answer these questions but also make real-time trade decisions—the ability to make real-time decisions optimized their business and improved its efficiency.
Businesses in almost every industry are finding value and establishing a competitive advantage by sharing data across their partner ecosystems. By unlocking the data that resides in internal systems and combining it with publicly available datasets, businesses are monetizing new products and building better offerings.
Businesses are taking it to the next level by combining data with artificial intelligence and machine learning. If you’re not ready for AI or machine learning, try a library of algorithms like TensorFlow, developed by Google. The real question is, how do you use them? When do you use them? How do you patch them into your existing environment? If you understand your data and find ways to leverage it using data tools, you’re going to be way ahead of the ball game.
In the past, trying a new project was a dangerous thing. It was expensive, it was risky, and people didn’t want to lose their jobs from trying a big new project and failing. That’s all changing, and there are a couple of critical components to making that transformation: the cloud and the ability to move and scale data. Now, it’s exciting to be a business innovator because you can easily rent space in the cloud, add some data, and see if the business process works. If it does, you can elevate it to your production across the organization at a relatively low cost.
One of the most significant benefits is improving the customer service experience by feeding the right contextual data either to the proper process or the right person at the right time. For consumers, data is coming off their phones in incredible quantities. Pipelining is about putting the pieces together and processing the real-time data, so business users can make decisions like, “what’s the next best offer I can give this person.”
A real estate database company funnels leads through its real-time data pipeline. If you call the business and leave a message or send a message via their website, within 10 minutes, you will receive a response. The ability to act fast with real-time analytics and customer data creates a holistic view of how businesses can provide the customer with the best possible experience.
HVR offers everything you need in one tool for real-time data replication, initial load and table creation, log-based change data capture, data validation, and visual statistics on how your data is moving so that you can optimize your data flows.
Are you interested in learning more?