This section describes the requirements, access privileges, and other features of HVR when using Azure Blob FS for replication. For information about compatibility and support for Azure Blob FS with various HVR platforms, see Platform Compatibility Matrix.
For the capabilities supported by HVR, see Capabilities.
This section lists and describes the connection details required for creating Azure Blob FS location in HVR.
Azure Blob FS
The type of security to be used for connecting to Azure Blob Server. Available options:
The Azure Blob storage account.
The name of the container available within storage Account.
The directory path in Container which is to be used for replication.
The access key of the storage Account.
Hive External Tables
Enable/Disable Hive ODBC connection configuration for creating Hive external tables above Azure Blob FS.
Hive ODBC Connection
HVR allows you to create Hive External Tables above Azure Blob FS files which are only used during compare. You can enable/disable the Hive configuration for Azure Blob FS in location creation screen using the field Hive External Tables . For more information about configuring Hive external tables, refer to Hadoop Azure Blob FS Support documentation.
Hive ODBC Connection
Hive Server Type
The type of Hive server. Available options:
Service Discovery Mode
The mode for connecting to Hive. This field is enabled only if Hive Server Type is Hive Server 2. Available options:
The hostname or IP address of the Hive server.
The TCP port that the Hive server uses to listen for client connections. This field is enabled only if Service Discovery Mode is No Service Discovery.
The name of the database schema to use when a schema is not explicitly specified in a query.
The namespace on ZooKeeper under which Hive Server 2 nodes are added. This field is enabled only if Service Discovery Mode is ZooKeeper.
The authentication mode for connecting HVR to Hive Server 2. This field is enabled only if Hive Server Type is Hive Server 2. Available options:
The username to connect HVR to Hive server. This field is enabled only if Mechanism is User Name or User Name and Password.
The password of the User to connect HVR to Hive server. This field is enabled only if Mechanism is User Name and Password.
The Kerberos service principal name of the Hive server. This field is enabled only if Mechanism is Kerberos.
The Fully Qualified Domain Name (FQDN) of the Hive Server 2 host. The value of Host can be set as _HOST to use the Hive server hostname as the domain name for Kerberos authentication.
The realm of the Hive Server 2 host.
The transport protocol to use in the Thrift layer. This field is enabled only if Hive Server Type is Hive Server 2. Available options:
The partial URL corresponding to the Hive server. This field is enabled only if Thrift Transport is HTTP.
Linux / Unix
Driver Manager Library
The directory path where the Unix ODBC Driver Manager Library is installed.
The directory path where odbc.ini and odbcinst.ini files are located.
The user defined (installed) ODBC driver to connect HVR to the Hive server.
Show SSL Options.
Enable/disable (one way) SSL. If enabled, HVR authenticates the Hive server by validating the SSL certificate shared by the Hive server.
Enable/disable two way SSL. If enabled, both HVR and Hive server authenticate each other by validating each others SSL certificate. This field is enabled only if Enable SSL is selected.
|Trusted CA Certificates||The directory path where the .pem file containing the server's public SSL certificate signed by a trusted CA is located. This field is enabled only if Enable SSL is selected.|
SSL Public Certificate
The directory path where the .pem file containing the client's SSL public certificate is located. This field is enabled only if Two-way SSL is selected.
SSL Private Key
The directory path where the .pem file containing the client's SSL private key is located. This field is enabled only if Two-way SSL is selected.
Client Private Key Password
The password of the private key file that is specified in SSL Private Key. This field is enabled only if Two-way SSL is selected.
The Hadoop client must be installed on the machine from which HVR will access the Azure Blob FS. Internally, HVR uses C API libhdfs to connect, read and write data to the Azure Blob FS during capture, integrate (continuous), refresh (bulk) and compare (direct file compare).
Azure Blob FS locations can only be accessed through HVR running on Linux or Windows, and it is not required to run HVR installed on the Hadoop NameNode although it is possible to do so. For more information about installing Hadoop client, refer to Apache Hadoop Releases.
Hadoop Client Configuration
The following are required on the machine from which HVR connects to Azure Blob FS:
- Hadoop 2.6.x client libraries with Java 7 Runtime Environment or Hadoop 3.x client libraries with Java 8 Runtime Environment. For downloading Hadoop, refer to Apache Hadoop Releases.
- Set the environment variable $JAVA_HOME to the Java installation directory.
- Set the environment variable $HADOOP_COMMON_HOME or $HADOOP_HOME or $HADOOP_PREFIX to the Hadoop installation directory, or the hadoop command line client should be available in the path.
- One of the following configuration is recommended,
- Set $HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*
Create a symbolic link for $HADOOP_HOME/share/hadoop/tools/lib/ in $HADOOP_HOME/share/hadoop/common or any other directory present in classpath.
Since the binary distribution available in Hadoop website lacks Windows-specific executables, a warning about unable to locate winutils.exe is displayed. This warning can be ignored for using Hadoop library for client operations to connect to a HDFS server using HVR. However, the performance on integrate location would be poor due to this warning, so it is recommended to use a Windows-specific Hadoop distribution to avoid this warning. For more information about this warning, refer to Hadoop issue HADOOP-10051.
Verifying Hadoop Client Installation
To verify the Hadoop client installation,
- The HADOOP_HOME/bin directory in Hadoop installation location should contain the hadoop executables in it.
Execute the following commands to verify Hadoop client installation:
If the Hadoop client installation is verified successfully then execute the following command to check the connectivity between HVR and Azure Blob FS:
To execute this command successfully and avoid the error "ls: Password fs.adl.oauth2.client.id not found", few properties needs to be defined in the file core-site.xml available in the hadoop configuration folder (for e.g., <path>/hadoop-2.8.3/etc/hadoop). The properties to be defined differs based on the Mechanism (authentication mode). For more information, refer to section 'Configuring Credentials' in Hadoop Azure Blob FS Support documentation.
Verifying Hadoop Client Compatibility with Azure Blob FS
To verify the compatibility of Hadoop client with Azure Blob FS, check if the following JAR files are available in the Hadoop client installation location ( $HADOOP_HOME/share/hadoop/tools/lib ):
HVR does not support client side encryption (customer managed keys) for Azure Blob FS. For more information about encryption of data in Azure Blob FS, search for "encryption" in Azure Blob storage documentation.
Client Configuration Files
Client configuration files are not required for HVR to perform replication, however, they can be useful for debugging. Client configuration files contain settings for different services like HDFS, and others. If the HVR integrate machine is not part of the cluster, it is recommended to download the configuration files for the cluster so that the Hadoop client knows how to connect to HDFS.
The client configuration files for Cloudera Manager or Ambari for Hortonworks can be downloaded from the respective cluster manager's web interface. For more information about downloading client configuration files, search for "Client Configuration Files" in the respective documentation for Cloudera and Hortonworks.
Hive External Table
HVR allows you to create Hive external tables above Azure Blob FS files which are only used during compare. The Hive ODBC connection can be enabled for Azure Blob FS in the location creation screen by selecting the Hive External Tables field.
For more information about configuring Hive external tables for Azure Blob FS, refer to Hadoop Azure Support: Azure Blob Storage documentation.
HVR uses the ODBC connection to the Hadoop cluster that requires an ODBC driver (Amazon ODBC 1.1.1 or HortonWorks ODBC 2.1.2 and above) for Hive installed on the machine (or in the same network). The Amazon and HortonWorks ODBC drivers are similar and compatible with Hive 2.x. However, it is recommended to use the Amazon ODBC driver for Amazon Hive and the Hortonworks ODBC driver for HortonWorks Hive.
HVR uses the Amazon ODBC driver or HortonWorks ODBC driver to connect to Hive for creating Hive external tables to perform hvrcompare of files that reside on Azure Blob FS.
By default, HVR uses Amazon ODBC driver for connecting to Hadoop. To use the Hortonworks ODBC driver the following action definition is required:
Azure Blob FS
Environment /Name = HVR_ODBC_CONNECT_STRING_DRIVER /Value = Hortonworks Hive ODBC Driver 64-bit
Azure Blob FS
Environment /Name = HVR_ODBC_CONNECT_STRING_DRIVER /Value = Hortonworks Hive ODBC Driver
For the file formats (CSV, JSON, and AVRO) the following action definitions are required to handle certain limitations of the Hive deserialization implementation during Bulk or Row-wise Compare: