AWS Data Integration Tools
AWS Data Integration Tools: Three Best Practices to Consider
Early cloud adopters were often startups attracted to the flexibility of pay-as-you-go and to the on-demand scalability of the cloud. Data security, however, was one of the major concerns in the early days of cloud computing, so IT’s adoption of cloud technology was slow.
Today, IT embraces the cloud. Enterprises such as AT&T, GE, and CapitalOne have publicly announced their intentions to shift significant percentages of their workloads to the cloud as data security in the cloud has become less of a concern. AWS, the market leader per Gartner’s Magic Quadrant for Cloud Infrastructure as a Service, is a major beneficiary of the recent trend of IT embracing the cloud. However, like with any technology platform, it isn’t a given that organizations can simply sign up for an account and reap immediate rewards without a careful approach. Organizations interested in leveraging the power of AWS should consider the following best practices when architecting data integration solutions.
AWS Data Integration Best Practice #1: Implement for Optimal Bandwidth and Latency
The first AWS best practice is related to performance. This practice assumes that—especially for large data transfers such as a data refresh (initial load)—the network is limiting performance. Two factors limit network data transfer rates: Bandwidth and Latency (round-trip time.) Most people understand bandwidth limitations because internet connections are rated by available bandwidth.
Latency, however, can limit network performance due to acknowledgements sent across the wire. The extent to which latency will limit data transfer rates depends on how network communication—generally TCP/IP—is used. Communication requires a round-trip to confirm whether data was received correctly, and between (1) the frequency of round-trips, (2) the amount of data transferred between round-trips, and (3) round-trip duration, sooner or later latency will start limiting data transfer rates. To maximize performance, implement an architecture that takes advantage of:
- Data compression, so that fewer data blocks have to be transferred and every block contains more data
- Large block transfers, to further limit the network transfers
- Communication optimizations, (e.g. by sending sets of blocks aggregating the delivery confirmation in larger chunks)
AWS Data Integration Best Practice #2: Identify Information of Interest to Improve Efficiency
With data transfer a potential bottleneck, it is important to minimize the amount of data that must be transferred. To do this, use Change Data Capture (CDC) techniques over bulk extracts and subsequent data comparisons. Log-based, asynchronous CDC is widely considered superior over alternatives like trigger-based capture because log-based CDC does not impact the actual transaction, and hence overhead on the transactional application is much less if even noticeable. Log-based CDC can be further optimized by running in a distributed setup. Many use cases don’t require all database changes, and even outside of that, database transaction logs store extra data beyond just table data changes. Identifying the subset of information that is of interest close to where the transaction logs are written, before sending changes across the network—compressed—makes sense from an efficiency perspective.
With data transfer a potential bottleneck, it is important to minimize the amount of data that must be transferred. To do this, use Change Data Capture (CDC) techniques.
AWS Data Integration Best Practice #3: Consider Recourse Other Than Firewalls for Data Security
One way organizations have implemented data security is to lockdown firewalls as tightly as possible, both limiting the open ports and limiting the network addresses that may get through, to prevent the likelihood that an outsider can gain access to your systems. With that corporate IT doesn’t like opening up firewalls into their network, so avoid this method if possible.
As an alternative to opening the firewall, consider these three options.
- First, initiate the communication on-premises. Inside the cloud, use the Virtual Private Cloud (VPC) IP addresses to communicate, rather than external IP addresses, to limit exposure.
- A second aspect of security is data encryption. Unless none of your data is sensitive in nature you cannot afford not to encrypt your data. Use SSL (Secure Socket Layer; encrypted) communication, or only pass encrypted data around. Amazon Key Management System is integrated with many of its services, and also accessible through APIs to perform client-side encryption.
- A third important aspect of security is authentication. There is the option in the AWS cloud to use authorized instance profiles to automatically manage rotation of authentication information. Consider adopting this capability to simplify password management. External authentication can be improved through explicit SSL certificates rather than using ones that negotiate themselves like https calls do.
AWS Is a Powerful Tool — Follow These Best Practices to Leverage Its Potential
Cloud Data Integration can apply to a variety of use cases: Whether it be from a variety of sources into an S3 data lake, migrating on-premises to the AWS cloud, running real-time analytics in the cloud or integrating into various cloud systems. Regardless of the use case, these three key best practice will ensure your cloud initiative is a success:
- Performance: How to maximize the bandwidth performance,
- Efficiency: Where and how to have the “work” take place to only process what is changing and,
- Security: How to keep the data secure in flight and at rest.
This post was originally posted in Data Center Knowledge and has since been modified.