What is Streaming Data?
“Streaming data“ is data that is continuously produced by a source and is processed without having access to all of the data (since streaming started). Radio and television are classic examples of streaming data. Nowadays it is fair to say that almost everyone who uses a computer or smartphone comes across streaming data in the form of digital media sent over the internet.
What is streaming data in today’s business world?
In the business world, there are many examples of streaming data, especially with the advent of connected devices and the Internet of Things (IoT). Consider the stream(s) of data a modern airplane produces and the functions that can be automated based on these like the autopilot function. But also consider the impact of incomplete assumptions around the design of a streaming data solution in an airplane like the Maneuvring Characteristics Augmentation System (MCAS) in the Boeing 737 MAX. And as we purchase more goods online and leave behind a trail of clicks and app uses, we open up a treasure trove for companies selling ads like Google and Facebook, as well as retailers who will try to combine the knowledge of our buying behavior with a stream of location data coming from mobile devices in order to maximize the likelihood that we buy from them again.
Streaming data technologies like Kafka, Amazon Kinesis Data Streams, and Microsoft Azure Event Hub have made the implementation of powerful use cases easier by adding a query language or analytics service directly on top of the data streams, blending the worlds of data streaming and Complex Event Processing (CEP).
Consider these examples of streaming data:
- Based on trading patterns on the market, investment specialists want to adjust the portfolio they manage as soon as possible to minimize risk whilst maximizing the likelihood of profitability.
- Customer interactions, for example on a support call, can be streamlined based on a genuine 360-degree view of the customer that is continuously changing.
- Complex, highly specialized machinery like a locomotive, for which every bit of improved productivity can generate millions of dollars of extra revenues, requires carefully planned maintenance with unique parts and specific expertise, to manage optimum efficiency.
- Fraud examples are across industries. What relevant information can help you decide whether or not to approve a transaction that just hit the system?
- When do you decide that the amount of traffic on your website is no longer caused by regular users but rather you are targeted by a DDoS (Distributed Denial of Service attack)?
- How do you optimize automotive traffic flow by adjusting traffic light timing, and switching flexible lanes to opposite directions?
As you think about the many use cases for streaming data in the business world, you may wonder, “where does the data come from?”
A lot of business data, in many cases some of the most important data in a business process, still goes through a relational database such as an ERP system. How can incremental database changes become part of a data stream? This is where Change Data Capture, CDC, comes in. In particular log-based CDC, a method to read changes asynchronously from the database’s transaction log, provides a good balance between minimal impact on the application making database changes, and minimum latency between application commit and change data capture. Of course, transactions in a database can be inserts, updates or deletes, which, in order to be processed as part of a data stream, must all be represented as a new row and enriched with metadata such as the operation type (insert, update, delete) and depending on the technology, an indication of the exact order of the data. Also, if transactional consistency in the data stream is important then it is important to publish the data at a transactionally consistent point in time. With its knowledge of the transaction boundaries, a log-based CDC technology can indicate (frequently) when a transactionally consistent state is reached with some key attributes about that consistency point (like the commit timestamp on the source, or the exact source system commit number).
How can incremental database changes become part of a data stream?
Log-Based CDC enables Real-Time Data Streams
HVR provides log-based CDC from a number of relational database technologies including Oracle, SQL Server, PostgreSQL, MySQL and others, supporting data delivery to a multitude of target technologies, in a variety of formats including JSON, Avro, Parquet and other formats. Kafka is an example of a streaming data technology HVR can write directly into, with HVR a Confluent Hub verified source. HVR’s distributed architecture enables a scalable setup with optimized and secure network communication. Customers use HVR’s modular architecture dominantly for data replication into the cloud, building data lakes, analytical databases and streaming data solutions, in some cases delivering data that was captured once into multiple destinations for different use cases.
CDC and Streaming Data in Action: How one airline leveraged both for real-time improvements
Air France HOP operates regional flights for Air France. A couple of years ago the organization was facing the challenge that the growth in daily flights increased the likelihood of schedule changes increased, whilst IT systems got more heavily loaded as a result of the growing number of daily flights. In order to maintain IT systems’ performance for customer-facing applications without investing in major system upgrades, the flight crew schedule updates would be performed less and less frequently, down to only three times per day. This led to more uncertainty and frustrations for the members of the crew. To solve this challenge with a scalable and affordable architecture, Air France HOP implemented a solution using HVR to perform log-based CDC on the operational databases, push the changes into Kafka, and consume the events from the data stream. Since this system went live in 2018, Air France HOP now operates more efficiently: by booking or changing any necessary hotel stays in a timely manner, having the ability to adjust training plans as needed, and the flight crew is kept up to date with their schedule changes in near real-time.
Want to learn more about streaming data from your traditional database systems? We invite to check out our Kafka Data Integration page or contact us as a member of our team of data integration experts would be happy to share their insights on data streaming.
Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.