Real-Time Analytics Implementations
Alternative Architectural Approaches to Implementing Real-Time Analytics
In other posts, I’ve mentioned several different architectural approaches that can be employed to access and analyze near real-time operational data. Because this is a frequent topic of conversation that we have with customers who are in the early stages of planning a real-time analytics deployment, I decided to dedicate a blog to explaining how each approach works and exploring their advantages and disadvantages.
Before assessing the relative merits of any given architectural approach, we always advise clients to first consider the unique characteristics of their specific environment. Answering the following questions can be helpful in determining which approach best fits your project:
- How many different systems is your operational data stored in?
- What is the structure of the operational data and how is it stored? Is it mainly structured data stored in relational databases? Or does it include unstructured and semi-structured data (ex. web logs, machine data) stored in file systems like Hadoop?
- How “mission critical” is each system? In other words, how tolerant can the organization afford to be if there is a temporary outage or degradation in transactional response times?
- Who will you be querying real-time data, and how often will it be needed? Will a business analyst run a query once or twice a week, or will the data be continuously queried by multiple users at all hours of the day?
- How complex will the queries be? Will they be simple filters against a single database, or will they entail multi-pass SQL statements, extensive joins, and complex calculations?
As you consider the following architectural approaches to implementing real-time analytics, the relevance of these questions will become evident.
Query federation (aka “virtual data warehouse”) – As the name implies, this approach involves federating queries across multiple source systems utilizing a logical data model. The concept is appealing because it eliminates the need to physically move and integrate data. Unfortunately, the promise of the virtual data warehouse, which has been in existence since the 1990s, has not been realized on a widespread basis. Organizations that attempt this approach usually discover that query performance is unacceptably slow. Worse, queries can be frozen if one of the sources is unavailable or if data joins are too complex. At the same time, the overhead on transactional systems often leads to unacceptable performance degradation, and even outages. Furthermore, because source systems are often modified or replaced, ongoing maintenance can be extremely difficult and expensive.
Event and trigger-based replication – This approach entails continuously updating a data warehouse or operational data store in real time. Event triggers that initiate a replication action, such as updates, deletions or creation of new records, are stored in the source system databases. This approach comes with a major drawback. All trigger-based systems, including native database trigger services or third party tools, impose significant overhead that slows down transactional database processing. Third party tools, needed for heterogeneous trigger-based replication, add an additional layer of complexity and performance degradation to transactional systems.
Log-based change data capture (CDC) – This real-time data replication approach provides a way of delivering real-time analytics without the performance problems and complexity issues associated with other approaches. Rather than relying on intrusive event-based triggering services that must continuously run on top of each transactional source system, log-based replication tools leverage standard log files that are natively generated by all common relational databases. There is zero impact on transactional systems because the changed data is captured from log files and moved to an operational data store or data warehouse for query and analysis. If needed these ODS and DW data stores can be modeled and tuned by leveraging dimensional schemas, materialized views and other techniques to ensure maximum response times for even the most demanding queries.
As I noted in another blog, we believe log-based CDC is superior for the vast majority real-time analytic environments. It is easy to implement and maintain, is reliable, minimizes network traffic, and is completely non-invasive to source and target systems. If you’re interested in learning about real-world customer use cases, check out our customer stories.