Redshift Data Warehouse
Best Practices for Integrating into your cloud-based data warehouse for Real-Time Analytics
The best way to load data into Redshift is using S3 and copy in statements. To achieve a high aggregate number of row changes, inserts, updates and deletes are best processed in set-based statements, modifying multiple rows in a single scan. Single-row operations are relatively slow due to storage allocation. Use gzip compression to limit data size.
3 – Distribute staging tables appropriately
To avoid unnecessary interconnect traffic, the staging tables should either be co-distributed with the base table they load, or be distributed to all nodes. Even distribution is generally not a good choice for the staging table. Beware that with the storage allocation at 1 MB per column per slice, depending on the frequency of the micro-batches and the aggregate data load coming in, it is relatively unlikely that more than one data block per column per slice is filled.
Out of the box HVR enables job scheduling that can help limit data load concurrency into Redshift.
5 – Vacuum regularly
Redshift generously allocations data, and new data is always appended to the table. To consolidate the data in fewer data blocks, and if sort keys were defined to resort the data, run the vacuum operation. To optimize the system vacuum is best run when there is no new data flowing into the system. Consider running vacuum at least weekly on a system that is continuously updated.
Pausing a data load is straightforward in HVR. The vacuum operation must be scheduled outside of HVR.
HVR supports encryption through S3 into Redshift.
Security is important, and with that rotating access credentials. AWS provides an automated way to rotate S3 credentials through the concept of the instance profile. Instance profiles are only available to servers in AWS.
For more information about integrating into AWS Cloud successfully, check out the following resources: