The term “hybrid cloud” has no single definition. When it was first used, it typically meant a combination of a private and public cloud. Since then, the term has arguably taken on a broader meaning, including (public) cloud offerings from multiple vendors. Gartner’s definition of “hybrid cloud computing” is what we follow for this blog post: ”hybrid cloud computing refers to policy-based and coordinated service provisioning, use, and management across a mixture of internal and external cloud services.”
Major cloud providers offer to run their cloud on-premises. For example, Amazon calls this AWS Outposts, there is Microsoft Azure Stack, Google Anthos, and Oracle—who has a large existing on-premises enterprise presence—continues to enhance its [email protected] offering.
Why Hybrid Cloud?
There are multiple reasons to consider hybrid cloud computing:
— Security and data privacy concerns drive your organization to maintain an on-premises cloud service and store less sensitive information in a public cloud with the flexibility to scale up/down as needed.
— Your organization chooses multiple cloud vendors instead of one to avoid vendor lock-in.
— As a result of using Software as a Service (SaaS), you end up using multiple clouds.
— You need or want access to a technology service that is only available on a specific cloud.
Hybrid cloud computing introduces the need to integrate data between clouds, with dominantly two schools of thought on how to achieve this:
— Data virtualization to combine the data upon request.
— Data replication to make a copy of the data.
In this blog post, I share the pros and cons of the two types of data integration methods and provide the technical considerations for the architecture.
Data Virtualization or Replication?
Data virtualization is a concept to enable access to data without knowledge about how/where the data is stored. A hybrid cloud architecture would lend itself well to serve data from different parts of the cloud, using what used to be referred to as a federated architecture to pull together data sets from different databases to perform analytical operations.
Whether data virtualization could work for your organization depends on several factors:
— What are the data volumes involved?
— How smart is the data virtualization layer to avoid extracting large data sets – for every query?
— How much surplus capacity does your data source system have to allow (ad-hoc) data retrieval?
— To what extent is data from different sources combined in a query, and what are performance requirements for response times?
Related to these considerations is how many applications or users take advantage of the data virtualization at any one time? What load is generated across the various systems, and what kind of technology/infrastructure is needed to host the data virtualization technology?
Data consolidation is an alternative to data federation. Data consolidation requires data to be replicated. Access to the data no longer results in load on the data source once a copy of it resides in a separate data store. Also, a lot of the (heavy) processing of combining data sets, filtering data, and computing aggregate information can be pushed down into the data platform instead of the virtualization layer. In addition to taking a full copy of the data, there are multiple ways to perform change data capture (CDC), keeping the target in sync in near real-time with more or less impact on the data sources.
In our customer base, we see organizations with high volumes and complex workloads tend to choose for data consolidation on a single data platform like Snowflake, S3, ADLS, etc.
A hybrid cloud computing architecture introduces Wide Area Network (WAN) communication between the different clouds. Even though the available bandwidth on this WAN is generally high, speed and responsiveness (latency) don’t quite match Local Area Network (LAN) connectivity. How do you get the most efficiency out of your network?
1. Move only the data you need. For a data virtualization environment, you pass the minimal data to satisfy a query, knowing it will get retrieved for every query. For a replicated data set, you would favor change data capture (CDC) over any approach that would repeatedly perform full data extracts. Also, filter and project data (eliminate columns/fields) are not required.
2. Use data compression. On top of moving only the data you need, you can “magnify” the required bandwidth by compressing the data before it moves—a 10x compression ratio results in 10x less bandwidth required to transfer the same volume of data.
3. Bundle data that goes over the wire to maximize bandwidth despite higher latency. A WAN connection introduces extra latency over a LAN connection. Sending larger bundles lowers the sensitivity for this higher latency (waiting for the acknowledgment that data was correctly received) while still achieving good throughput.
Data security is top of mind for most organizations, primarily due to the reputational damage of a data breach. Ensure your hybrid cloud architecture implements sound security best practices, especially with data flowing into and out of public cloud infrastructure.
“There are only two types of companies—those that know they’ve been compromised, and those that don’t know. If you have anything that may be valuable to a competitor, you will be targeted, and almost certainly compromised.” – Dmitri Alperovitch, McAfee Vice President of Threat Research
— Use encryption, both in-flight and at rest. Encrypt your data using a robust industry-standard algorithm such as AES256. Use unique (securely stored) certificates so that if your organization is compromised, the perpetrator(s) will have a hard time making sense of the data.
— Lockdown firewalls. Make sure to lock down the firewall as much as possible to lower the likelihood of getting compromised, especially in primary data processing systems. Consider using a proxy that is both the gateway to a data endpoint and the gatekeeper to prevent unauthorized access.
— Use secure and strong authentication to prevent the likelihood of data getting compromised.
The last consideration is data accuracy. This should not be a concern in a data virtualization architecture since you use direct access to the data sources. However, data accuracy is an essential consideration for a replicated data set. How can you be confident that the consolidated target data set is a correct representation of the data source? Do you have access to a data validation solution that routinely checks data values (beyond comparing just row counts)? Is there a strategy to validate the data against data sources that are active 24×7?
Will data virtualization or replication work better for you in a hybrid cloud architecture? The workload on your systems will help determine which approach works best. Hopefully, this blog’s considerations can help you decide what technologies you need to support your business requirements.
HVR’s sole focus is to provide heterogeneous data replication technology for complex environments. To learn more about how HVR addresses the considerations for hybrid cloud architecture, please visit our resources page.