The Major Types of Data Integration
The Major Types of Data Integration (& When To Use Them)
Data is currency. Clean, prepared, useful data represents value, just like cold cash (if not more so). Poorly-sourced, irresponsibly-stored, or corrupted data is as worthless as counterfeit bills (and just as risky). Data integration, when properly executed, is a value-adding process that is “integral” to an organization having the competitive edge necessary for today’s business climate. But those quotes aren’t pun-intended — the word integral means, first and foremost, “necessary to make whole and complete.” Implementing cohesive data integration techniques is as essential to any enterprise as having its individual components: the data itself, a place to store the data, and a device with which to observe it all.
The importance of proper integration can be thought of in terms of walking into your kitchen after work: there’s a fridge full of groceries (ready to be extracted), a pantry full of spices and a stove (“tools” ready to transform and then “load” dinner onto the table). Without a plan, an executable strategy, nothing happens. This simplified analogy of the ETL process — one of the main ways we integrate raw data to produce useful visualization and insight via BI tools — is nonetheless accurate: if a coordinated, effective method is not in place, optimization is not possible.
Unlike our kitchen example, a successful data integration strategy is not necessarily visible, though its results surely will be. The good news — no, Domino’s does not deliver streamlined business data in 30 minutes or less, though they reportedly use an impressive 85,000 data points on those nights when you don’t feel like cooking — is that there actually is quite a bit of power available at our fingertips. Let’s talk about how we get there.
First, some basic requirements we should all expect from how our data moves from Point A to Point B:
- We want it secure
- We want it moved efficiently, and quickly
- We want our data to be accurate on the other end
In short, we want data platforms in which monitoring, reporting, and validating our data are all prioritized, platforms that embrace a flexible, agile architecture capable of scaling as our company grows.
Let’s then tailor generic questions to the specific needs of an enterprise:
- What data jobs are essential? These tasks are routinely addressed and continuously monitored, the demands you expect when deploying your current data integration techniques.
- What data jobs are desired, but have thus far not been possible? These are business needs you understand deserve attention, but for one reason or another have yet to realize.
Fairly straightforward concepts to grasp, as is our unified goal: The ability to access data in a way that provides end users a 360-degree view to derive insights and optimize workflow. Let’s now proceed to method.
Integration Parameters: Accessibility, Applicability, and Speed to Insight
Data integration is nothing new in business, but relative speed and scale continuously improve. If you’re old enough to remember (or have at least heard
of) fax machines, you can appreciate how quickly “cutting-edge” tech for business can become sub-optimal, if not obsolete. Likewise, the old-fashioned export/import method of attaching a file, from a singular entity to another, configuring fields as needed so as to be legible, is worth mentioning because that is data integration — and we all still operate with email attachments — but we’ll focus on how larger data volumes are moved, prepared, and disseminated.
Below are examples of the major types of data integration, along with advantages and disadvantages to consider when deciding on implementing them. The three we are highlighting here are ETL (Extract, Transform, Load), Data Virtualization, and Data Replication. This is meant to serve as a foundational overview to grasp their basic differences; a more comprehensive understanding of each is advised when choosing the best type(s) depending on your specific data integration needs.
The Extract, Transform, Load process describes the infrastructure built to pull the raw data required for analysis (extract), convert and prepare it in some useful form for a business need (transform), and deliver it to a data warehouse (load). The ETL process a company chooses will show its efficacy by competently tackling business needs while simultaneously structured not to take on the unnecessary. After all, an optimized data integration technique — moving data from multiple locales and/or disparate systems to an organized database, rendered clean and delivered useful for insights — reinforces the goal of data democratization, encouraging independent, self-service data visualization and analysis for everyone across an organization’s ecosystem.
A legacy or traditional form of implementing ETL is to process the data in batches (say, once every 24 hours). With batch processing, data is stored, and collection stopped at some point to “forklift” the data over to your data warehouse. A more modern approach — one where real-time analytics are crucial, latency factors are significant, and extracting and transforming large batches of data in bulk may be counterintuitive for business needs — is to have the ETL pipeline perform stream processing of data. Stream processing is moving data at speed, and should be viewed as preferable, perhaps mandatory, when the caliber of insight and value possible from touching data immediately decreases as the time component increases. That said, using stream processing means adopting some form of change data capture (CDC), which is used to determine the data that has changed, serving as a basis to synchronize another system with the same incremental changes, or to store an audit trail capable of tracking changes.
The benefits of ETL are very straightforward:
- Ease of use
- Easy to understand
- Easy to move large volumes of data
- There are plenty of ETL tools out there, and a lot of expertise to assist in using them
Disadvantages of ETL-only include:
- Latency (this must be considered in scenarios where the longer we wait for data the less valuable it may become)
- A high-recurring load on the source system
- Potential difficulty in identifying certain operations, which may become highly resource-intensive
- It is non-transactional. If there are records to update (or errors to fix), the entire ETL job must be restarted
An apt example of both in context might be the allocation of data jobs at a hospital. Updating monthly billing records or quarterly reporting is necessary batch processing; integrating relevant data in real-time (patient histories, current meds, any potential allergic reactions, etc.) could literally affect patient well-being, and a clear cut case where latency is truly the enemy, potentially compromising optimal care. And in a macro sense, streamlined data integration can only improve cohesion between the often disparate clinical and business sides of many healthcare operations.
Data virtualization effectively combines data from disparate sources, assembling it into digestible formats. It is more of an umbrella term, describing an approach to data management that should be considered as integrating data virtually, allowing access to data without requiring technical details about where it resides or in what form. This form of data integration has grown from the cornerstone of data federation, technologies that allow two or more databases to appear as one central data store or repository, remaining within its primary data source until required for specific downstream needs. With federated data, the consuming application is able to query both the data held in the central store and linked data remotely, and from a single connection. Data virtualization has evolved from this core principle, producing a logical data layer that integrates enterprise data siloed across disparate systems to present to business users in real-time. It is not virtualized data storage, which can make it easy to confuse at first, and important that the distinction is made and understood. It is virtualized access.
Benefits of this form of data integration include the efficient acceleration of data mining, and less expense involved than replicating and transforming data to various locations, perhaps requiring different formats. Think of it as inserting a layer of middleware, allowing data from different models to be integrated virtually, so data sources can connect with data end-users.
For example, picture the virtual oversight of inventory in supply chain management, or the ease at which a call center might increase customer satisfaction by quickly and efficiently accessing relevant, single-view data. By furnishing integrated views of data in-memory, rather than executing actual data movement, data virtualization provides a layer of abstraction above the physical implementation of data, simplifying querying logic.
Traditional disadvantages of data virtualization include poor performance for large volume data access, and the load put on the data source every time a query is submitted. Efficiency is effectively compromised when any transformations on the data are processed over and over again.
Data replication refers to the technology involved in creating copies of database objects and affiliated data. A significant benefit is the ability to provide log-based change data capture to replicate data changes in real-time, keeping the transactional integrity to support big data integration and consolidation. By storing the same data in multiple locations, data availability, accessibility, and reliability all improve. Real-time analytics may be performed with less complexity.
A few obvious, intuitive benefits of data replication:
-Increased, large-scale availability, allowing multiple teams access to useful data streams as needed.
-Decreased latency in accessing important data that is fundamentally tied to keeping a competitive edge in business (read: increased performance)
-Seamless data sharing and synchronization, facilitating not just visualization for team members in various locations, but active problem-solving as well.
-Backup and targeting! Replicating data from a single source to multiple targets or from multiple sources to a single target, can only be a good thing.
Let’s close with a typical business scenario describing why one would gravitate toward using a tool and/or outside vendor for their data integration strategy: Your organization wants the most robust data pipeline possible. You’re determined to create an environment that allows for autonomy and ownership when it comes to data visualization and implementation, and further conclude that getting there means incorporating specialized tools from trusted vendors, tools tailored to your specific business needs. Finally, you have neither the time, desire, nor expertise to spend building your own solution.
At HVR, we understand the best integration practice is related to performance. Our data integration strategies aim to provide just that: a broad solution capable of servicing the needs of hybrid cloud (on-premise to cloud), intra-cloud (multiple databases to the same cloud), or inter-cloud (data transfers between two different clouds). As always, the emphasis is on robust growth and ease of use, and to constantly — and reliably — enable continuous data integration.