What is Data Streaming?
A Data Architect’s Guide
In the past decade, there has been an unprecedented proliferation of big data and analytics. The term big data has been loosely used in so many different scenarios that it’s fair to say – big data is really what you want it to be – it’s just…big. Typically defined by structured and unstructured data, originated from multiple applications, consisting of historical and real-time information – the three V’s, volume, velocity and variety came into the vernacular when defining big data. I’d like to add another V for ‘value’. Data has to be valuable to the business and to realize the value, data needs to be integrated, cleansed, analyzed, and queried.
Inexpensive storage, public cloud adoption, and innovative data integration technologies together can be the perfect fire triangle when it comes to deploying data lakes, data ponds, data dumps – each supporting a specific use case. Over the past five years, innovation in streaming technologies became the oxidizer of the big data forest fire.
Data streaming is one of the key technologies deployed in the quest to yield the potential value from big data. This blog post provides an overview of data streaming, its benefits, uses, and challenges, as well as the basics of data streaming architecture and tools.
Big Data is defined by the three V’s – Volume, Velocity, and Variety:
Volume – Data is being generated in larger quantities by an ever-growing array of sources including social media and e-commerce sites, mobile apps, as well as IoT connected sensors and devices. Businesses and organizations are finding new ways to leverage big data to their advantage, but also face the challenge of processing this vast amount of new data to extract precisely the information they need.
Velocity – Thanks to advanced WAN and wireless network technology large volumes of data can now be moved from source to destination at unprecedented speed. Organizations with the technology to rapidly process and analyze this data as it arrives can gain a competitive advantage in their ability to rapidly make informed decisions.
Variety – Big Data comes in many different formats, including structured financial transaction data, unstructured text strings, simple numeric sensor readings, as well as audio and video streams. While organizations have hardly scratched the surface of the potential value that this data presents they face the challenge of parsing and integrating these varied formats to produce a coherent stream of data.
Extracting the potential value from big data requires technology that is capable of capturing large fast-moving streams of diverse data, processing the data into a format that can be rapidly digested and analyzed.
Value — As noted above, HVR would add a 4th V for ‘value’. Data has to be valuable to the business and to realize the value, data needs to be integrated, cleansed, analyzed, and queried.
What Is Data Streaming?
Data streaming is the process of transmitting, ingesting, and processing data continuously rather than in batches. Data streaming is a key capability for organizations that want to generate analytic results in real-time. The value in streamed data lies in the ability to process and analyze it as it arrives.
Streaming vs Batch Processing
To better understand data streaming it is useful to compare it to traditional batch processing. In batch processing, data is collected over time and stored often in a persistent repository such as a database or data warehouse. The data can then be accessed and analyzed at any time.
As an example of batch processing, consider a retail store that captures transaction data from its point-of-sale terminals throughout each day. This data is stored in a relational database. The data is gathered during a limited period of time, the store’s business hours. The data is cumulatively gathered so that varied and complex analysis can be performed over daily, weekly, monthly, quarterly, and yearly timeframes to determine store sales performance, calculate sales commissions, or analyze the movement of inventory.
While batch processing is an efficient way to handle large volumes of data where the value of analysis is not immediately time-sensitive, it is not suited to processing data that has a very brief window of value – minutes or even seconds from the instant it is generated.
Data streaming technology is used to continuously process and analyze this data as it is received to identify suspicious patterns take immediate action to stop potential threats.
A cybersecurity team at a large financial institution continuously monitors the company’s network to detect potential data breaches and fraudulent transactions. To do this they must monitor and analyze multiple streams of data including internal server and network activity, as well as external customer transactions at branch locations, ATMs, point of sale terminals and on e-commerce sites.
With millions of customers and thousands of employees at locations around the world, the numerous streams of data generated by this activity are massive, diverse and fast moving. Data streaming technology is used to continuously process and analyze this data as it is received to identify suspicious patterns take immediate action to stop potential threats.
Data Streaming Benefits
Data that is generated in never-ending streams does not lend itself to batch processing where data collection must be stopped to manipulate and analyze the data. The ability to focus on any segment of a data stream at any level is lost when it is broken into batches. In contrast, data streaming is ideally suited to inspecting and identifying patterns over rolling time windows.
Data streaming also allows for the processing of data volumes and types that would be impractical to store in a conventional data repository such as a relational database. Stream processing allows for the handling of data volumes that would overwhelm a typical batch processing system, sorting out and storing only the pieces of data that have longer-term value.
Data that is generated in a continuous flow is typically time-series data. It is generated and transmitted according to the chronological sequence of the activity that it represents. Stream processing is a natural fit for handling and analyzing time-series data.
Data Streaming Applications
The following scenarios illustrate how data streaming can be used to provide value to various organizations:
- An airline monitors data from various sensors installed in its aircraft fleet to identify small but abnormal changes in temperature, pressure, and output of various components. This allows the airline to detect early signs of defects, malfunctions, or wear so that they can provide timely maintenance.
- An investment firm streams stock market data in real-time and combined with financial data from its various holdings to identify immediate opportunities and adjust its portfolios accordingly.
- A clothing retailer monitors shopping activity on their website and combines it with real-time data mobile devices to send promotional discount offers to customers in their physical store locations based on the customer’s shopping history.
Data Streaming Architecture
The fundamental components of a streaming data architecture are:
Data Source – Producer
The most essential requirement of stream processing is one or more sources of data, also known as producers. Producers are applications that communicate with the entities that generate the data and transmit it to the streaming message broker.
Many Web and cloud-based applications have the capability to act as producers, communicating directly with the message broker.
On-premises data required for streaming and real-time analytics is often written to relational databases which do not have native data streaming capability. Incorporating this data into a data streaming framework can be accomplished using a log-based Change Data Capture solution which acts as the producer by extracting data from the source database and transferring it to the message broker.
The message broker receives data from the producer and converts it into a standard message format and then publishes the messages in a continuous stream called topics. The message broker can also store data for a specified period. Apache Kafka and Amazon Kinesis Data Streams are two of the most commonly used message brokers for data streaming.
The Stream Processor receives data streams from one or more message brokers and applies user-defined queries to the data to prepare it for consumption and analysis. For example, a producer might generate log data in a raw unstructured format that is not ideal for consumption and analysis. The message broker can pass this data to a stream processor which can perform various operations on the data such as extracting the desired information elements and structuring it into a consumable format. Apache Storm and Spark Streaming are two of the most commonly used stream processors.
After the stream processor has prepared the data it can be streamed to one or more consumer applications. Consumer applications may be automated decision engines that are programmed to take various actions or raise alerts when they identify specific conditions in the data. More commonly streaming data is consumed by a data analytics engine or application, such as Amazon Kinesis Data Analytics that allow users to query and analyze the data in real-time.
Interested in learning more about how you can stream data from your traditional databases? Get our guide on Data Streaming with Log-Based CDC.