Requirements for Azure Data Lake Store
This section describes the requirements, access privileges, and other features of HVR when using Azure Data Lake Store (DLS) for replication. For information about compatibility and support for Azure DLS with HVR platforms, see Platform Compatibility Matrix.
- 1 Location Connection
- 2 Hadoop Client
- 3 Authentication
- 4 Client Configuration Files
- 5 Hive External Table
This section lists and describes the connection details required for creating Azure DLS location in HVR.
|Host||The IP address or hostname of the Azure DLS server. |
|Directory||The directory path in Host where the replicated changes are saved.|
|Mechanism||The authentication mode for connecting HVR to Azure DLS server. Available options: |
|OAuth2 Endpoint||The URL used for obtaining bearer token with credential token. This field is enabled only if the authentication Mechanism is Service-to-service. |
|Client ID||The Client ID (or Application ID) used to obtain access token with either credential or refresh token. This field is enabled only if the authentication Mechanism is either Service-to-service or Refresh Token. |
|Key||The credential used for obtaining the initial and subsequent access tokens. This field is enabled only if the authentication Mechanism is Service-to-service.|
|Token||The directory path to the text file containing the refresh token. This field is enabled only if the authentication Mechanism is Refresh Token.|
|Port||The port number for the REST endpoint of the token service exposed to localhost by the identity extension in the Azure VM (default value: 50342). This field is enabled only if the authentication Mechanism is MSI.|
|Hive External Tables||Enable/Disable Hive ODBC connection configuration for creating Hive external tables above Azure DLS.|
Hive ODBC Connection
HVR allows you to create Hive External Tables above Azure DLS files which are only used during compare. You can enable/disable the Hive configuration for Azure DLS in location creation screen using the field Hive External Tables. For more information about configuring Hive external tables, refer to Hadoop Azure Data Lake Support documentation.
|Hive ODBC Connection|
|Hive Server Type||The type of Hive server. Available options:
|Service Discovery Mode||The mode for connecting to Hive. This field is enabled only if Hive Server Type is Hive Server 2. Available options:
|Host(s)||The hostname or IP address of the Hive server.|
When Service Discovery Mode is ZooKeeper, specify the list of ZooKeeper servers in following format [ZK_Host1]:[ZK_Port1],[ZK_Host2]:[ZK_Port2], where [ZK_Host] is the IP address or hostname of the ZooKeeper server and [ZK_Port] is the TCP port that the ZooKeeper server uses to listen for client connections.
|Port||The TCP port that the Hive server uses to listen for client connections. This field is enabled only if Service Discovery Mode is No Service Discovery. |
|Database||The name of the database schema to use when a schema is not explicitly specified in a query. |
|ZooKeeper Namespace||The namespace on ZooKeeper under which Hive Server 2 nodes are added. This field is enabled only if Service Discovery Mode is ZooKeeper.|
|Mechanism||The authentication mode for connecting HVR to Hive Server 2. This field is enabled only if Hive Server Type is Hive Server 2. Available options: |
|User||The username to connect HVR to Hive server. This field is enabled only if Mechanism is User Name or User Name and Password. |
|Password||The password of the User to connect HVR to Hive server. This field is enabled only if Mechanism is User Name and Password.|
|Service Name||The Kerberos service principal name of the Hive server. This field is enabled only if Mechanism is Kerberos.|
|Host||The Fully Qualified Domain Name (FQDN) of the Hive Server 2 host. The value of Host can be set as _HOST to use the Hive server hostname as the domain name for Kerberos authentication.|
If Service Discovery Mode is disabled, then the driver uses the value specified in the Host connection attribute.
If Service Discovery Mode is enabled, then the driver uses the Hive Server 2 host name returned by ZooKeeper.
This field is enabled only if Mechanism is Kerberos.
|Realm||The realm of the Hive Server 2 host.|
It is not required to specify any value in this field if the realm of the Hive Server 2 host is defined as the default realm in Kerberos configuration. This field is enabled only if Mechanism is Kerberos.
|Linux / Unix|
|Driver Manager Library||The directory path where the Unix ODBC Driver Manager Library is installed. |
|ODBCSYSINI||The directory path where odbc.ini and odbcinst.ini files are located. |
|ODBC Driver||The user defined (installed) ODBC driver to connect HVR to the Hive server.|
|SSL Options||Displays the SSL options.|
|Enable SSL||Enable/disable (one way) SSL. If enabled, HVR authenticates the Hive server by validating the SSL certificate shared by the Hive server.|
|Two-way SSL||Enable/disable two way SSL. If enabled, both HVR and Hive server authenticate each other by validating each others SSL certificate. This field is enabled only if Enable SSL is selected.|
|SSL Public Certificate||The directory path where the .pem file containing the client's SSL public certificate is located. This field is enabled only if Two-way SSL is selected.|
|SSL Private Key||The directory path where the .pem file containing the client's SSL private key is located. This field is enabled only if Two-way SSL is selected.|
|Client Private Key Password||The password of the private key file that is specified in SSL Private Key. This field is enabled only if Two-way SSL is selected.|
The Hadoop client should be present on the machine from which HVR will access the Azure DLS. Internally, HVR uses the WebHDFS REST API to connect to the Azure DLS. Azure DLS locations can only be accessed through HVR running on Linux or Windows, and it is not required to run HVR installed on the Hadoop NameNode although it is possible to do so. For more information about installing Hadoop client, refer to Apache Hadoop Releases.
Hadoop Client Configuration
The following are required on the machine from which HVR connects to Azure DLS:
- Hadoop 2.6.x client libraries with Java 7 Runtime Environment or Hadoop 3.x client libraries with Java 8 Runtime Environment. For downloading Hadoop, refer to Apache Hadoop Releases.
- Set the environment variable $JAVA_HOME to the Java installation directory.
- Set the environment variable $HADOOP_COMMON_HOME or $HADOOP_HOME or $HADOOP_PREFIX to the Hadoop installation directory, or the hadoop command line client should be available in the path.
- One of the following configuration is recommended,
- Add $HADOOP_HOME/share/hadoop/tools/lib into Hadoop classpath.
- Create a symbolic link for $HADOOP_HOME/share/hadoop/tools/lib in $HADOOP_HOME/share/hadoop/common or any other directory present in classpath.
Verifying Hadoop Client Installation
To verify the Hadoop client installation,
- The HADOOP_HOME/bin directory in Hadoop installation location should contain the hadoop executables in it.
- Execute the following commands to verify Hadoop client installation:
- If the Hadoop client installation is verified successfully then execute the following command to verify the connectivity between HVR and Azure DLS:
$JAVA_HOME/bin/java -version $HADOOP_HOME/bin/hadoop version $HADOOP_HOME/bin/hadoop classpath
$HADOOP_HOME/bin/hadoop fs -ls adl://<cluster>/
Verifying Hadoop Client Compatibility with Azure DLS
To verify the compatibility of Hadoop client with Azure DLS, check if the following JAR files are available in the Hadoop client installation location ($HADOOP_HOME/share/hadoop/tools/lib):
hadoop-azure-<version>.jar hadoop-azure-datalake-<version>.jar azure-data-lake-store-sdk-<version>.jar azure-storage-<version>.jar
HVR supports the following three authentication modes for connecting to Azure DLS:
- Refresh Token
This option is used if an application needs to directly authenticate itself with Data Lake Store. The connection parameters required in this authentication mode are OAuth2 Token Endpoint, Client ID (application ID), and Key (authentication key). For more information about the connection parameters, search for "Service-to-service authentication" in Data Lake Store Documentation.
This option is used if a user's Azure credentials are used to authenticate with Data Lake Store. The connection parameters required in this authentication mode are Client ID (application ID), and Token (refresh token). The refresh token should be saved in a text file and the directory path to this text file should be mentioned in the Token field of location creation screen. For more information about the connection parameters and end-user authentication using REST API, search for "End-user authentication" in Data Lake Store Documentation.
This option is preferred when you have HVR running on a VM in Azure. Managed Service Identity (MSI) allows you to authenticate to services that support Azure Active Directory authentication. For this authentication mode to work, the VM should have access to Azure DLS and the MSI authentication should be enabled on the VM in Azure. The connection parameters required in this authentication mode is Port (MSI endpoint port), by default the port number is 50342. For more information about providing access to Azure DLS and enabling MSI on the VM, search for "Access Azure Data Lake Store" in Azure Active Directory Managed Service Identity Documentation
HVR does not support client side encryption (customer managed keys) for Azure DLS. For more information about encryption of data in Azure DLS, search for "encryption" in Data Lake Store Documentation.
Client Configuration Files
Client configuration files are not required for HVR to perform replication, however, they can be useful for debugging. Client configuration files contain settings for different services like HDFS or HBASE. If the HVR integrate machine is not part of the cluster, it is recommended to download the configuration files for the cluster so that the Hadoop client knows how to connect to HDFS.
The client configuration files for Cloudera Manager or Ambari for Hortonworks can be downloaded from the respective cluster manager's web interface. For more information about downloading client configuration files, search for "Client Configuration Files" in the respective documentation for Cloudera and Hortonworks.
Hive External Table
HVR allows you to create Hive external tables above Azure DLS files which are only used during compare. The Hive ODBC connection can be enabled for Azure DLS in the location creation screen by selecting the Hive External Tables field (see section Location Connection). For more information about configuring Hive external tables for Azure DLS, refer to Hadoop Azure Data Lake Support documentation.
HVR uses ODBC connection to the Hadoop cluster for which it requires the ODBC driver (Amazon ODBC 1.1.1 or HortonWorks ODBC 2.1.2 and above) for Hive installed on the machine (or in the same network).
The Amazon and HortonWorks ODBC drivers are similar and compatible to work with Hive 2.x release. However, it is recommended to use the Amazon ODBC driver for Amazon Hive and the Hortonworks ODBC driver for HortonWorks Hive.
By default, HVR uses Amazon ODBC driver for connecting to Hadoop. To use the Hortonworks ODBC driver the following action definition is required:
|Azure DLS||*||Environment /Name=HVR_ODBC_CONNECT_STRING_DRIVER /Value=Hortonworks Hive ODBC Driver 64-bit|
|Azure DLS||*||Environment /Name=HVR_ODBC_CONNECT_STRING_DRIVER /Value=Hortonworks Hive ODBC Driver|
For the file formats (CSV, JSON, and AVRO) the following action definitions are required to handle certain limitations of the Hive deserialization implementation during Bulk or Row-wise Compare:
- For CSV,
- For JSON,
- For Avro,
|Azure DLS||*||FileFormat /NullRepresentation=\\N|
|Azure DLS||*||TableProperties /CharacterMapping="\x00>\\0;\n>\\n;\r>\\r;">\""|
|Azure DLS||*||TableProperties /MapBinary=BASE64|
|Azure DLS||*||TableProperties /MapBinary=BASE64|
|Azure DLS||*||FileFormat /JsonMode=ROW_FRAGMENTS|
|Azure DLS||*||FileFormat /AvroVersion=v1_8|
v1_8 is the default value for FileFormat /AvroVersion, so it is not mandatory to define this action.