How important are “transactions”?
Streaming Data: A Definition
Amazon Web Services defines Streaming Data as “data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).” Internet of Things (IoT) data is an obvious example with this definition of Streaming Data. Fitness trackers, connected cars and smart phones are consumer devices generating streaming data, but large industrial equipment (aircraft engines, turbines, etc.) does as well.
Then at the beginning of the second paragraph, AWS continues the definition of Streaming Data: “This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling.” In other words, Streaming Data only produces new data points, and never changes to previously provided data points. There is also an implicit assumption in Streaming Data that there is not really a notion of an initial data set.
The idea of a stream of changes is in line with the fundamental concept of the relational database log as Jay Kreps points out in his blog post at explaining Kafka. However, most users don’t see their changes to records as an ongoing stream of events, but rather as modifications to the current state of a record.
Transactions as a Source for Streaming Data
At HVR we see a lot of organizations use transactions against a traditional relational database (e.g. Oracle, SQL Server, PostgreSQL, etc.) as a source for Streaming Data. The stream consists of many row changes, each of which gets processed as a new data point irrespective of whether the row changes was an insert, update or delete. An extra field typically holds the operation type to indicate the most recent operation. With that, a consumer of the stream of row changes can again construct the current state of the source system (should that be required, depending on the use case). Some use cases will require an initial data load.
However, a major difference between the source transactional database and the data stream as a generic concept is that transaction boundaries exist in a relational database, but not in a data stream (like in Kafka or Amazon Kinesis). Transactions may span multiple tables and even though the existence of a row change indicates the change was committed it may be relevant to know what other row changes were part of the same transaction. Related to the transaction boundary is the transactionally consistent view of the data. A typical relational database will either show no changes for in-flight transactions, or all changes after the commit took place. In a data stream, there is not necessarily such consistency.
Of course, some applications/use cases are more sensitive to the concept of transactions than others, and with that, it is more or less important to include and highlight the concept of transactions in a data stream. Most Change Data Capture (CDC) technologies including HVR provide the ability to include a representation of the source transaction (e.g. the commit number) in an extra field to know what row changes were part of the same transaction. With that – and with some additional effort (also depending on the streaming technology you use) – you can still provide the same level of consistency in the Streaming Data application as you can in the source transactional database.
But there is more, and this where we are focusing some of our engineering efforts. If somehow the stream is interrupted or data is skipped, the application needs to know about it. And if a data set is replaced then the consumer should be made aware. Sharing metadata about the data is really the way to fully enable streaming use cases, and that is what our technology enables.
Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.