Requirements for Azure Data Lake Storage Gen2
Last updated on Feb 26, 2021
Since v5.6.5/2
Contents |
---|
Capture | Hub | Integrate |
---|---|---|
This section describes the requirements, access privileges, and other features of HVR when using Azure Data Lake Storage (DLS) Gen2 for replication. For information about compatibility and support for Azure DLS Gen2 with HVR platforms, see Platform Compatibility Matrix.
For the capabilities supported by HVR, see Capabilities.
For information about the supported data types and mapping of data types in source DBMS to the corresponding data types in target DBMS or file format, see Data Type Mapping.
For instructions to quickly set up replication using Azure DLS Gen2, see Quick Start for HVR - Azure DLS Gen2.
Location Connection
This section lists and describes the connection details required for creating Azure DLS Gen2 location in HVR.
Field | Description |
---|---|
Azure DLS Gen2 | |
Secure connection | The type of security to be used for connecting to Azure DLS Gen2. Available options:
|
Account | The Azure DLS Gen2 storage account. |
Container | The name of the container available within storage Account. |
Directory | The directory path in Container to be used for replication. |
Authentication | |
Type | The type of authentication to be used for connecting to Azure DLS Gen2. Available options:
For more information about these authentication types, see section Authentication. |
Secret Key | The access key of the storage Account. This field is enabled only if authentication Type is Shared Key. |
Mechanism | The authentication mode for connecting HVR to Azure DLS Gen2 server. This field is enabled only if authentication Type is OAuth. The available option is Client Credentials. |
OAuth2 Endpoint | The URL used for obtaining bearer token with credential token. |
Client ID | A client ID (or application ID) used to obtain Azure AD access token. |
Client Secret | A secret key used to validate the Client ID. |
Hadoop Client
For Linux (x64) and Windows (x64), since HVR 5.7.0/8 and 5.7.5/4, it is not required to install and configure the Hadoop client. However, if you want to use the Hadoop client, set the environment variable HVR_AZURE_USE_HADOOP=1 and follow the steps mentioned below.
It is mandatory to install and configure the Hadoop client for HVR versions prior to 5.7.0/8 or 5.7.5/4.
Authentication
HVR supports the following two authentication modes for connecting to Azure DLS Gen2:
- Shared Key
When this option is selected, hvruser gains full access to all operations on all resources, including setting owner and changing Access Control List (ACL). The connection parameter required in this authentication mode is Secret Key - a shared access key that Azure generates for the storage account. For more information on how to manage access keys for Shared Key authorization, refer to Manage storage account access keys. Note that with this authentication mode, no identity is associated with a user and permission-based authorization cannot be implemented. - OAuth
This option is used to connect to Azure DLS Gen2 storage account directly with OAuth 2.0 using the service principal. The connection parameters required for this authentication mode are OAuth2 Endpoint, Client ID, and Client Secret. For more information, refer to Azure Data Lake Storage Gen2 documentation.
Encryption
HVR does not support client side encryption (customer managed keys) for Azure DLS Gen2. For more information about the encryption of data in Azure DLS Gen2 refer to Data Lake Storage Documentation.
Client Configuration Files for Hadoop
Client configuration files are not required for HVR to perform replication, however, they can be useful for debugging. Client configuration files contain settings for different services like HDFS, and others. If the HVR integrate machine is not part of the cluster, it is recommended to download the configuration files for the cluster so that the Hadoop client knows how to connect to HDFS.
The client configuration files for Cloudera Manager or Ambari for Hortonworks can be downloaded from the respective cluster manager's web interface. For more information about downloading the client configuration files, search for "Client Configuration Files" in the respective documentation for Cloudera and Hortonworks.
Integrate
HVR allows you to perform HVR Refresh or Integrate changes into an Azure DLS Gen2 location. This section describes the configuration requirements for integrating changes (using HVR Refresh or Integrate) into the Azure DLS Gen2 location.
Customize Integrate
Defining action Integrate is sufficient for integrating changes into an Azure DLS Gen2 location. However, the default file format written into a target file location is HVR's own XML format and the changes captured from multiple tables are integrated as files into one directory. The integrated files are named using the integrate timestamp.
You may define other actions for customizing the default behavior of integration mentioned above. Following are few examples that can be used for customizing integration into the Azure DLS Gen2 location:
Group | Table | Action | Annotation |
---|---|---|---|
Azure DLS Gen2 | * | This action may be defined to:
| |
Azure DLS Gen2 | * | Integrate /RenameExpression | To segregate and name the files integrated into the target location. For example, if /RenameExpression={hvr_tbl_name}/{hvr_integ_tstamp}.csv is defined, then for each table in the source, a separate folder (with the same name as the table name) is created in the target location, and the files replicated for each table are saved into these folders. This also enforces unique name for the files by naming them with a timestamp of the moment when the file was integrated into the target location. |
Azure DLS Gen2 | * | ColumnProperties | This action defines properties for a column being replicated. This action may be defined to:
|