Requirements for S3

From HVR
Jump to: navigation, search

This section describes the requirements, access privileges, and other features of HVR when using Amazon S3(Simple Storage Service) for replication. For information about compatibility and support for S3 with HVR platforms, see Platform Compatibility Matrix.

S3
Capture Hub Integrate
Icon-Yes.png Icon-No.png Icon-Yes.png

Location Connection

This section lists and describes the connection details required for creating S3 location in HVR.

SC-Hvr-Location Amazon s3.png
Field Description
S3
Secure Connection The type of connection security. Available options:
  • Yes (https) (default): HVR will connect to S3 Server using HTTPS.
  • No (http): HVR will connect to S3 Server using HTTP.
S3 Bucket The IP address or hostname of the S3 bucket.
  Example: rs-bulk-load
Directory The directory path in S3 Bucket which is to be used for replication.
  Example: /myserver/hvr/s3
Credentials The authentication mode for connecting HVR to S3 by using IAM User Access Keys (Key ID and Secret Key). For more information about Access Keys, refer to Access Keys (Access Key ID and Secret Access Key)).
Key ID The access key ID of IAM user to connect HVR to S3. This field is enabled only if Credentials is selected.
  Example: AKIAIMFNIQMZ2LBKMQUA
Secret Key The secret access key of IAM user to connect HVR to S3. This field is enabled only if Credentials is selected.
Instance Profile Role The authentication mode for connecting HVR to S3 by using AWS Identity and Access Management (IAM) Role. This option can be used only if the HVR remote agent or the Hub is running inside the AWS network. When a role is used, HVR obtains temporary Access Keys Pair. For more information about IAM Role, refer to Roles
Hive External Tables Enable/Disable Hive ODBC connection configuration for creating Hive external tables above S3.


Note: If there is an HVR agent running on Amazon EC2 node, which is in the AWS network together with the S3 bucket, then the communication between the HUB and AWS network is done via HVR protocol, which is more efficient than direct S3 transfer. Another approach to avoid the described bottleneck is to configure the HUB on a EC2 node.


Hive ODBC Connection

HVR allows you to create Hive External Tables above S3 files which are only used during compare. You can enable/disable the Hive configuration for S3 in location creation screen using the field Hive External Tables.

Field Description
Hive ODBC Connection
Hive Server Type The type of Hive server. Available options:
  • Hive Server 1 (default): The driver connects to a Hive Server 1 instance.
  • Hive Server 2: The driver connects to a Hive Server 2 instance.
Service Discovery Mode The mode for connecting to Hive. This field is enabled only if Hive Server Type is Hive Server 2.  Available options:
  • No Service Discovery (default): The driver connects to Hive server without using the ZooKeeper service.
  • ZooKeeper: The driver discovers Hive Server 2 services using the ZooKeeper service.
Host(s) The hostname or IP address of the Hive server.
When Service Discovery Mode is ZooKeeper, specify the list of ZooKeeper servers in following format [ZK_Host1]:[ZK_Port1],[ZK_Host2]:[ZK_Port2], where [ZK_Host] is the IP address or hostname of the ZooKeeper server and [ZK_Port] is the TCP port that the ZooKeeper server uses to listen for client connections.
  Example: hive-host
Port The TCP port that the Hive server uses to listen for client connections. This field is enabled only if Service Discovery Mode is No Service Discovery.
  Example: 10000
Database The name of the database schema to use when a schema is not explicitly specified in a query.
  Example: mytestdb
ZooKeeper Namespace The namespace on ZooKeeper under which Hive Server 2 nodes are added. This field is enabled only if Service Discovery Mode is ZooKeeper.
Authentication
Mechanism The authentication mode for connecting HVR to Hive Server 2.  This field is enabled only if Hive Server Type is Hive Server 2. Available options:
  • No Authentication (default)
  • User Name
  • User Name and Password
  • Kerberos
User The username to connect HVR to Hive server. This field is enabled only if Mechanism is User Name or User Name and Password.
  Example: dbuser
Password The password of the User to connect HVR to Hive server. This field is enabled only if Mechanism is User Name and Password.
Service Name The Kerberos service principal name of the Hive server. This field is enabled only if Mechanism is Kerberos.
Host The Fully Qualified Domain Name (FQDN) of the Hive Server 2 host. The value of Host can be set as _HOST to use the Hive server hostname as the domain name for Kerberos authentication.
If Service Discovery Mode is disabled, then the driver uses the value specified in the Host connection attribute.
If Service Discovery Mode is enabled, then the driver uses the Hive Server 2 host name returned by ZooKeeper.
This field is enabled only if Mechanism is Kerberos.
Realm The realm of the Hive Server 2 host.
It is not required to specify any value in this field if the realm of the Hive Server 2 host is defined as the default realm in Kerberos configuration. This field is enabled only if Mechanism is Kerberos.
Linux / Unix
Driver Manager Library The directory path where the Unix ODBC Driver Manager Library is installed. For a default installation, the ODBC Driver Manager Library is available at /usr/lib64 and does not need to specified. When UnixODBC is installed in for example /opt/unixodbc-2.3.2 this would be /opt/unixodbc-2.3.2/lib.
ODBCSYSINI The directory path where odbc.ini and odbcinst.ini files are located. For a default installation, these files are available at /etc and does not need to be specified. When UnixODBC is installed in for example /opt/unixodbc-2.3.2 this would be /opt/unixodbc-2.3.2/etc.
ODBC Driver The user defined (installed) ODBC driver to connect HVR to the Hive server.
SSL Options Displays the SSL options.

SSL Options

SC-Hvr-Location Hive SSL.png
Field Description
Enable SSL Enable/disable (one way) SSL. If enabled, HVR authenticates the Hive server by validating the SSL certificate shared by the Hive server.
Two-way SSL Enable/disable two way SSL. If enabled, both HVR and Hive server authenticate each other by validating each others SSL certificate. This field is enabled only if Enable SSL is selected.
SSL Public Certificate The directory path where the .pem file containing the client's SSL public certificate is located. This field is enabled only if Two-way SSL is selected.
SSL Private Key The directory path where the .pem file containing the client's SSL private key is located. This field is enabled only if Two-way SSL is selected.
Client Private Key Password The password of the private key file that is specified in SSL Private Key. This field is enabled only if Two-way SSL is selected.


Permissions

To run a capture or integration with Amazon S3 location, it is recommended that the AWS User has the AmazonS3FullAccess permission policy. AmazonS3ReadOnlyAccess policy is enough only for capture locations, which have a LocationProperties /StateDirectory defined. The minimal permission set for integrate locations consists of: s3:GetBucketLocation, s3:ListBucket, s3:ListBucketMultipartUploads, s3:AbortMultipartUpload, s3:GetObject, s3:PutObject and s3:DeleteObject.

S3 Encryption

HVR supports client or server side encryption for uploading files into S3 locations. To enable client or server side encryption for S3, see action LocationProperties /S3Encryption.

AWS China

For enabling HVR to interact with AWS China cloud, define the Environment variable HVR_AWS_CLOUD with value CHINA on the hub and remote machine.
Note that S3 encryption with Key Management Service (KMS) is not supported in AWS China cloud.


Hive External Table

HVR allows you to create Hive external tables above Azure DLS files which are only used during compare. The Hive ODBC connection can be enabled for Azure DLS in the location creation screen by selecting the Hive External Tables field (see section Location Connection). For more information about configuring Hive external tables for Azure DLS, refer to Hadoop Azure Data Lake Support documentation.

ODBC Connection

HVR uses ODBC connection to the Amazon EMR cluster for which it requires the ODBC driver (Amazon ODBC 1.1.1 or HortonWorks ODBC 2.1.2 and above) for Hive installed on the machine (or in the same network). On Linux, HVR additionally requires unixODBC 2.3.0 or later.

The Amazon and HortonWorks ODBC drivers are similar and compatible to work with Hive 2.x release. However, it is recommended to use the Amazon ODBC driver for Amazon Hive and the Hortonworks ODBC driver for HortonWorks Hive.

By default, HVR uses Amazon ODBC driver for connecting to Hadoop. To use the Hortonworks ODBC driver the following action definition is required:
For Linux,

Group Table Action
Azure DLS * Environment /Name=HVR_ODBC_CONNECT_STRING_DRIVER /Value=Hortonworks Hive ODBC Driver 64-bit

For Windows,

Group Table Action
Azure DLS * Environment /Name=HVR_ODBC_CONNECT_STRING_DRIVER /Value=Hortonworks Hive ODBC Driver

Amazon does not recommend changing the security policy of the EMR. This is the reason why it is required to create a tunnel between the machine where the ODBC driver is installed and the EMR cluster. On Linux, Unix and MacOS you can create the tunnel with the following command:

ssh -i ~/mykeypair.pem -N -L 8157:ec2-###-##-##-###.compute-1.amazonaws.com:8088 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com

Channel Configuration

For the file formats (CSV, JSON, and AVRO) the following action definitions are required to handle certain limitations of the Hive deserialization implementation during Bulk or Row-wise Compare:

  • For CSV,
  • Group Table Action
    Azure DLS * FileFormat /NullRepresentation=\\N
    Azure DLS * TableProperties /CharacterMapping="\x00>\\0;\n>\\n;\r>\\r;">\""
    Azure DLS * TableProperties /MapBinary=BASE64
  • For JSON,
  • Group Table Action
    Azure DLS * TableProperties /MapBinary=BASE64
    Azure DLS * FileFormat /JsonMode=ROW_FRAGMENTS
  • For Avro,
  • Group Table Action
    Azure DLS * FileFormat /AvroVersion=v1_8

    v1_8 is the default value for FileFormat /AvroVersion, so it is not mandatory to define this action.