Redshift Data Warehouse 

Best Practices for Integrating into your cloud-based data warehouse for Real-Time Analytics

Amazon Redshift is a managed data warehouse service in the cloud. Redshift originated in 2012 as a fork of the ParAccel database following Amazon’s investment in the company ParAccel. Redshift is a columnar, Massively Parallel Processing (MPP) scale out, database built for fast analytical queries running on commodity hardware. In this post, learn some best practices for integrating continuously into your Redshift Data Warehouse environment.

Background

An important concept in Redshift is the unit of parallelism, the slice. Storage is allocated per slice, per column (it is a columnar database after all), and the block size is large at 1 MB. With that wide tables in Redshift take up a lot of space even if there are few rows. A table with 200 columns allocates at least 200 x 1 MB per slice, even if there is only one row in the table. The number of slices per node varies between 2 and 32 depending on the node type, and is documented in the Cluster Management Guide.

The choice for data distribution is important to take advantage of scale-out parallelism and avoid sending data through the interconnect to resolve joins between large tables. Redshift supports one column for data distribution, and other options for distribution include diststyle even (round-robin) and distribute all (store all data on every node).

Redshift’s support for DML is compatible with other databases and includes of course support for insert, update and delete. Under the hood an update is the combination of delete followed by insert, and the insert is always an append operation. A delete operation always marks a row as deleted. Storage allocation is only released with a vacuum operation.

In Redshift every column has an implicit, lightweight, index to indicate high/low value per block (also referred to as storage index, or zone map) for the query optimizer to know whether or not to scan a block, if data is filtered on the column. Regular (B-tree) indexes cannot be created in Redshift, and constraints are only created for the optimizer’s benefit to generate more-informed SQL execution plans. Constraints are not enforced.

Continuous Data Integration – best practices

1 – Pick your distribution keys

Distribution keys should be selected for optimum query performance. Redshift supports maximum one column in the distribution key. Depending on the data set an additional sort key may be defined to instruct the database to sort data on the sort key column(s). Sort keys can speed up filters and optimize joins. However sort keys may hurt data load performAWSance, and new data will be unsorted until a vacuum operation is run.

For HVR every table has a key, either the primary key from the source application, or all non-LOB columns in the table. By default the first key column is used for the distribution key when HVR creates tables. The key column can be selected in the GUI, but also a single action can be set to exclude a column with a specific name on all tables. HVR does not support sort key definition, so to use sort keys bring your own DDL.

2 – Load data through S3, compressed

The best way to load data into Redshift is using S3 and copy in statements. To achieve a high aggregate number of row changes, inserts, updates and deletes are best processed in set-based statements, modifying multiple rows in a single scan. Single-row operations are relatively slow due to storage allocation. Use gzip compression to limit data size.

HVR transparently performs changes using such a micro-batch approach through staging (burst) tables when activating the Burst option as part of the Integrate action. HVR always automatically uses gzip compression for data loads, both for the initial refresh and for incremental changes.

3 – Distribute staging tables appropriately

To avoid unnecessary interconnect traffic, the staging tables should either be co-distributed with the base table they load, or be distributed to all nodes. Even distribution is generally not a good choice for the staging table. Beware that with the storage allocation at 1 MB per column per slice, depending on the frequency of the micro-batches and the aggregate data load coming in, it is relatively unlikely that more than one data block per column per slice is filled.

By default, to guarantee co-location of the base and burst (staging) tables, HVR will distribute the burst table using the base table distribution key. Use the database object generation action to switch to distribute all for the burst tables.

4 – Limit concurrency

Redshift performs really well when concurrency is limited, but slows down worse than linearly when concurrency increases. This applies to some extent to query concurrency, but more so to DML concurrency. Database commits have a high overhead due to cluster synchronization and to ensure read consistency. Redshift enables workload control through a workload manager that can automatically queue requests that exceed desirable concurrency, and rules can be defined separately for DML operations.

Out of the box HVR enables job scheduling that can help limit data load concurrency into Redshift.

5 – Vacuum regularly

Redshift generously allocations data, and new data is always appended to the table. To consolidate the data in fewer data blocks, and if sort keys were defined to resort the data, run the vacuum operation. To optimize the system vacuum is best run when there is no new data flowing into the system. Consider running vacuum at least weekly on a system that is continuously updated.

Pausing a data load is straightforward in HVR. The vacuum operation must be scheduled outside of HVR.

6 – Encrypt your data

Assume all data is valuable, so use encryption to avoid unnecessary data breaches. HVR supports encryption through S3 into Redshift.

7 – Use service roles and instance profiles

Security is important, and with that rotating access credentials. AWS provides an automated way to rotate S3 credentials through the concept of the instance profile. Instance profiles are only available to servers in AWS.

S3 authentication in HVR is either through an access key/secret access key combination, or by indicating the role that must be part of the instance profile.

For more information about integrating into AWS Cloud successfully, check out the following resources:

Six Best Practices to Hybrid Cloud Integration

Best Practices for Data Integration into AWS (Webinar) 

Best Practice Guide for Data Integration into AWS

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

© 2019 HVR

Try now Contact us