Quick Start for HVR into Hadoop HBase

From HVR
Jump to: navigation, search

This appendix shows how to set up an HVR channel (called hvr_demo01) to replicate from an Oracle Database into Hadoop HBase using the Hadoop Distributed File System (HDFS) and the Hive AgentPlugin. In this example HVR will replicate from a schema inside a single Oracle instance on the hub machine into HDFS. The AgentPlugin, during refresh, will (re)-create a hive external table (with suffix __b) which contains the structure of the source table including optional extra columns. From the hive structure, a HBase table will be created. During integration data will be moved from Hive burst table into an HBase table, then the hive bust table and HDFS CSV file will be deleted. The steps below start by creating new users and tables for HVR to replicate between Oracle and HBASE. For integration to HBase HVR has 2 possible out of the box setups. 1 option is with /Timekey where all dml operations from source will be integrated as inserts into HBase. 1 option is with /SoftDelete where delete operations on source will be integrated as update operations in HBase. In a real live situation tables and data likely already exist.

Before following this quickstart, please make sure the requirements have been met; see Requirements for HDFS.

Create Test Schema and Tables

  1. REDIRECT Create Source Schema and Tables in Oracle

Install HVR on the hub

First read section Introduction which explains the HVR's terminology and architecture. In particular this explains the importance of a hub database.

Then install the HVR software on the hub machine by following the installation steps in section Installing HVR on Windows or Installing HVR on Unix or Linux. If the hub machine is a Unix machine then HVR can either be installed on a Windows PC (so the HVR GUI can run on the PC and connect to the Unix hub machine) or the HVR GUI can be run on the Unix hub machine and connect back to an X server running on a PC.

This Quickstart assumes the Oracle Database on the hub server is also the source database. Most real-time integration scenarios use log-based capture. To enable log-base capture, configure the following:

  • The user name that HVR uses must be in Oracle's group. On Unix and Linux this can be done by adding the user name used by HVR to the line in /etc/group that begins with dba. On Windows right–click My Computer and select Manage ▶ Local Users and Groups ▶ Groups ▶ ora_dba ▶ Add to Group ▶ Add.
  • The Oracle instance should have archiving enabled. Archiving can be enabled by running the following statement as sysdba against a mounted but unopened database: alter database archivelog. The current state of archiving can be checked with query select log_mode from v$database.

The current archive destination can be checked with query select destination, status from v$archive_dest. By default, this will return values USE_DB_RECOVERY_FILE_DEST, VALID, which is inside the flashback recovery area. Alternatively, an archive destination can be defined with the following statement: alter system set log_archive_dest_1='location=/disk1/arc' and then restart the instance.

Install HVR to connect to Hadoop

Install the HVR software on the Linux machine by following the installation steps in section Installing HVR on Unix or Linux. Additionally follow the steps in Requirements for HDFS.

Create the Hub Database

Create the hub database, in which the HVR GUI will store the channel definition. This is actually another user/schema in the Oracle instance.

$ sqlplus system/manager
SQL> create user hvrhub identified by hvr
 2  default tablespace users
 3  temporary tablespace temp
 4  quota unlimited on users;
 
SQL> grant create session to hvrhub;
SQL> grant create table to hvrhub;
SQL> grant create sequence to hvrhub;
SQL> grant create procedure to hvrhub;
SQL> grant create trigger to hvrhub;
SQL> grant create view to hvrhub;
SQL> grant execute any procedure to hvrhub;

Connect To Hub Database

Start the HVR GUI on a PC by clicking on the HVR GUI icon (this is created by the HVR Installer for Windows) or by running hvrgui on Linux.

First, Register the hub database: right–click on hub machines ▶ Register hub. Enter connection details.

SC-Hvr-RegisterHub Oracle remote 4343.png
In this example the hub is a machine called guam, where an INET daemon is listening on port 4343. See section Installing HVR on Unix or Linux for how to configure this.

For a new hub database a dialog will prompt Do you wish to create the catalogs?; answer Yes.

Create Oracle Location

Next create a location for the Oracle source database using right–click on Location Configuration ▶ New Location.

SC-Hvr-Location Oracle db1.png

In this example there is no need to check Connect to HVR on remote machine because testdb1 is on the same machine as the hub.
Ignore the Group Membership tab for now.

Create HDFS Location

Create a location for HDFS using right–click on Location Configuration ▶ New Location.

SC-Hvr-Location HDFS.png

In this example Connect to HVR on remote machine is checked assuming this Linux environment is not the hub machine. If it is then it must not be checked.

Ignore the Group Membership tab for now.

In order to connect to a Hadoop cluster with Kerberos authentication, you can consult HDFS Authentication and Kerberos.

Create a Channel

The next step is to create a channel. For a relational database the channel represents a group of tables that is captured as unit. Create a channel using right–click on Channel Definitions ▶ New Channel.

Choose any name you like.

Create Location Groups

The channel needs two location groups. Under the new channel: right–click on Location Groups ▶ New Group. Enter a group name (for instance OLTP).
SC-Hvr-LocationGroup demo01 OLTP hdfs.png

Add location db1 as a member of this group by checking the box for db1.
Then create a second location group, called DATALAKE that has member hdfs.
The new channel also needs a list of tables to replicate. This can be done as follows; right–click on Tables ▶ Table Explore.

  • Choose the database location ▶ Connect.
  • In the Table Explore window, click on both tables and click Add.
  • In new dialog HVR Table Name click OK.
  • Close the Table Explore window.

Define Hive AgentPlugin environment parameters

  • Right–click on Group DATALAKE ▶ New Action ▶ Environment. Select name HVR_HIVE_PORT. Select value port of HiveServer2, default port of HiveServer2 is 10000.
  • Right–click on Group DATALAKE ▶ New Action ▶ Environment. Select name HVR_HIVE_DATABASE. Select value Hive database name, default Hive database is default.
  • Right–click on Group DATALAKE ▶ New Action ▶ Environment. Select name HVR_HIVE_USER. Select value Hive user name, default Hive user is HDFS.

When using kerberos security 1 additional environment parameter needs to be added;

  • Right–click on Group DATALAKE ▶ New Action ▶ Environment. Select name HVR_HIVE_KRB_PRINCIPAL. Select value kerberos principal of the Hive server.

Below actions for 2 different setups are defined. Either choose setup for HBase with /TimeKey or HBase with /SoftDelete.

Define Actions for HBase tables with /Timekey

The new channel needs actions to define the replication.

  • Right–click on group OLTP ▶ New Action ▶ Capture.
  • Right–click on group DATALAKE ▶ New Action ▶ FileFormat. Select parameter /Csv, QuoteCharacter=", /Compress=GZIP.
  • Right–click on group DATALAKE ▶ New Action ▶ Integrate. Check /Burst. Check /RenameExpression, and put in {hvr_tbl_name}__b/{hvr_tbl_name}.csv.gz.
  • Right–click on Group DATALAKE ▶ New Action ▶ ColumnProperties. Select /Name=hvr_op_val. Select /Extra. Select /IntegrateExpression={hvr_op}. Select /Datatype=int.
  • Right–click on Group DATALAKE ▶ New Action ▶ ColumnProperties. Select /Name=hvr_integ_key. Select /Extra. Select /IntegrateExpression={hvr_integ_key}. Select /Datatype=varchar. Select /Length=16. Select /Timekey.
  • Right–click on Group DATALAKE ▶ New Action ▶ ColumnProperties. Select /Name=hvr_integ_tstamp. Select /Extra. Select /IntegrateExpression={hvr_integ_tstamp}. Select /Datatype=datetime.

For a detailed explanation, see ColumnProperties

When /Compress=GZIP is used make sure the extension in the /RenameExpression ends with .gz otherwise, when selecting data within Hive or HBase, data is not automatically decompressed when reading the HDFS files.

  • Right–click on group DATALAKE ▶ New Action ▶ AgentPlugin. Select /Command=hvrhiveagent.py. Select /UserArgument=-h -k hvr_keys -t hvr_integ_key -x "execution engine". Select /Context=!preserve_during_refr.
  • Right–click on group DATALAKE ▶ New Action ▶ AgentPlugin. Select /Command=hvrhiveagent.py. Select /UserArgument=-h -k hvr_keys -p -t hvr_integ_key -x "execution engine". Select /Context=preserve_during_refr.

/UserArgument=-h is used to set AgentPlugin in HBase mode, meaning move data from csv files into HBase tables.
/UserArgument=-k hvr_keys is used to identify the key for the HBase table. hvr_keys is the key as specified in the channel definition for the table + additional /Key in /ColumnProperties, like for example in above action would be the column called hvr_integ_key.
/UserArgument=-t hvr_integ_key is used to specify AgentPlugin is in Timekey mode and timekey column is hvr_integ_key.
/UserArgument=-p is used so AgentPlugin does not re-create the hive table structure during Hvrrefresh.
/UserArgument=-x is used to define the hive execution engine. When -x is NOT defined the default execution engine is mr (Map Reduce). Other execution engines possible are Tez or Spark.

  • Right–click on Group DATALAKE ▶ New Action ▶ Scheduling. Check /IntegrateStartTimes, and select from the calendar. For example for a 10 minute refresh interval starting at the top of the hour check multiples of 10 for resulting RefreshStartTimes /0,10,20,30,40,50 * * * *.

Define Actions for HBase tables with /SoftDelete

The new channel needs actions to define the replication.

  • Right–click on group OLTP ▶ New Action ▶ Capture.
  • Right–click on group DATALAKE ▶ New Action ▶ FileFormat. Select parameter /Csv, QuoteCharacter=", /Compress=GZIP.
  • Right–click on group DATALAKE ▶ New Action ▶ Integrate. Check /Burst. Check /RenameExpression, and put in {hvr_tbl_name}__b/{hvr_tbl_name}.csv.gz.
  • Right–click on Group DATALAKE ▶ New Action ▶ ColumnProperties. Select /Name=hvr_is_deleted. Select /Extra. Select /SoftDelete. Select /Datatype=int.


hvr_is_deleted.A source delete record will be updated in HBASE with a value of 1.


if /Compress=GZIP is used make sure the extention in the /RenameExpression ends with .gz otherwise, when selecting data within hive, data is not decompressed when reading the HDFS files.

  • Right–click on group DATALAKE ▶ New Action ▶ AgentPlugin. Select /Command=hvrhiveagent.py. Select /UserArgument=-h -k hvr_keys -s hvr_is_deleted -x "execution engine". Select /Context=!preserve_during_refr.
  • Right–click on group DATALAKE ▶ New Action ▶ AgentPlugin. Select /Command=hvrhiveagent.py. Select /UserArgument=-h -k hvr_keys -p -s hvr_is_deleted -x "execution engine". Select /Context=preserve_during_refr.

/UserArgument=-h is used to set AgentPlugin in HBase mode, meaning move data from csv files into HBase tables.
/UserArgument=-k hvr_keys is used to identify the key for the HBase table. hvr_keys is the key as specified in the channel definition for the table + additional /Key in /ColumnProperties, like for example in above action would be the column called hvr_integ_key.
/UserArgument=-s hvr_is_deleted is used to specify AgentPlugin is in SoftDelete mode and softdelete columns is hvr_is_deleted.
/UserArgument=-p is used so AgentPlugin does not re-create the hive table structure during Hvrrefresh.
/UserArgument=-x is used to define the hive execution engine. When -x is NOT defined the default execution engine is mr (Map Reduce). Other execution engines possible are Tez or Spark.

  • Right–click on Group DATALAKE ▶ New Action ▶ Scheduling. Check /IntegrateStartTimes, and select from the calendar. For example for a 10 minute refresh interval starting at the top of the hour check multiples of 10 for resulting RefreshStartTimes /0,10,20,30,40,50 * * * *.


Overview Channel HBase TABLES WITH TIMEKEY COLUMN

Group Table Action
OLTP * Capture
DATALAKE * Environment /Name="HVR_HIVE_PORT" /value=<Hiveserver 2 port>
DATALAKE * Environment /Name="HVR_HIVE_DATABASE" /value=<Hive database>
DATALAKE * Environment /Name="HVR_HIVE_USER" /value=<Hive user>
DATALAKE * Integrate /Burst /RenameExpression=”{hvr_tbl_name}__b/{hvr_tbl_name}.csv.gz”
DATALAKE * FileFormat /Csv /QuoteCharacter=" /Compress=GZIP
DATALAKE * ColumnProperties /Name=hvr_op_val /Extra /IntegrateExpression={hvr_op} /Datatype=int
DATALAKE * ColumnProperties /Name=hvr_integ_key /Extra /IntegrateExpression=”{hvr_integ_key}”/Datatype=varchar /Length=16 /Key /TimeKey
DATALAKE * ColumnProperties /Name=hvr_integ_tstamp /Extra /IntegrateExpression=”{hvr_integ_tstamp}” /Datatype=datetime
DATALAKE * AgentPlugIn /Command=hvrhiveagent.py /UserArgument=”-h -k hvr_keys -t hvr_integ_key -x 'execution engine'”/Context=!preserve_during_refr
DATALAKE AgentPlugIn /Command=hvrhiveagent.py /UserArgument=”h -k hvr_keys -p -t hvr_integ_key -x 'execution engine'” /Context=preserve_during_refr
DATALAKE * Scheduling /IntegrateStartTimes=”0,10,20,30,40,50 * * * *”

Overview Channel HBase TABLES WITH SOFTDELETE COLUMN

Group Table Action
OLTP * Capture
DATALAKE * Environment /Name="HVR_HIVE_PORT" /value=<Hiveserver 2 port>
DATALAKE * Environment /Name="HVR_HIVE_DATABASE" /value=<Hive database>
DATALAKE * Environment /Name="HVR_HIVE_USER" /value=<Hive user>
DATALAKE * Integrate /Burst /RenameExpression=”{hvr_tbl_name}__b/{hvr_tbl_name}.csv.gz”
DATALAKE * FileFormat /Csv /QuoteCharacter=" /Compress=GZIP
DATALAKE * ColumnProperties /Name=hvr_is_deleted /Extra /SoftDelete /Datatype=int
DATALAKE * AgentPlugIn /Command=hvrhiveagent.py /UserArgument=”-h -k hvr_keys -s hvr_is_deleted -x 'execution engine'”/Context=!preserve_during_refr
DATALAKE * AgentPlugIn /Command=hvrhiveagent.py /UserArgument=”h -k hvr_keys -p -s hvr_is_deleted -x 'execution engine'” /Context=preserve_during_refr
DATALAKE * Scheduling /IntegrateStartTimes=”0,10,20,30,40,50 * * * *”

Below an HVRGUI showing an example channel with actions defined to integrate into HBase tables using Timekey option.


SC-Hvr-Integrate to HBase timekey.PNG
Note that the Actions pane only displays actions related to the objects selected in the left–hand pane. So click on channel hvr_demo01 to see both actions.

Enable Replication with HVR Initialize

Now that the channel definition is complete, create the runtime replication system.
Right–click on channel hvr_demo01 ▶ HVR Initialize. Choose Create or Replace Objects and click HVR Initialize.

SC-Hvr-InitializeFinished demo01.png

From the moment that HVR Initialize is done, all new transactions that start on the database testdb1 will be captured by HVR when its capture job looks inside the transaction logs.
HVR Initialize also creates two replication jobs, which can be seen under the Scheduler node in the GUI.

Start Scheduling of Capture Job

Start the Scheduler on the hub machine by clicking in the HVR GUI on the Scheduler node of the hub database.

SC-Hvr-Gui-Scheduler oracle.png

Next, instruct the HVR Scheduler to trigger the capture job to keep the capture job current with transaction log generation.

SC-Hvr-Gui demo01 Start cap HDFS.png

The capture job inside the Scheduler executes a script under $HVR_CONFIG/job/hvrhub/hvr_demo01 that has the same name as the job. So job hvr_demo01–cap–db1 detects changes on database testdb1 and stores these as transactions files on the hub machine.

Perform Initial Load

Perform the initial load from the OLTP database into HDFS using HVR Refresh. Right–click on the channel hvr_demo01 ▶ HVR Refresh.

SC-Hvr-Gui-Refresh HDFS.png

Make sure the option for Online Refresh is checked and select Skip Previous Capture and Integration. Optionally set Parallelism for Tables.
Run the Refresh.

SC-Hvr-Refresh to HBase.PNG

Start Scheduling of Integrate Job

Instruct the HVR Scheduler to Unsuspend the integrate job to push changes into HDFS according to the defined schedule.

SC-Hvr-Gui demo01 Unsuspend integ HDFS.png

The integrate job inside the Scheduler executes a script under $HVR_CONFIG/job/hvrhub/hvr_demo01 that has the same name as the job. So job hvr_demo01–integ-hdfs picks up transaction files on the hub on a defined schedule and creates files on HDFS containing these changes.
A scheduled job that is not running is in a PENDING state.

Test Replication

To test replication, make a change in testdb1:

testdb1/hvr
SQL> insert into dm01_product values (1, 19.99, 'DVD');
SQL> commit;

Next, instruct the HVR Scheduler to Trigger the integrate job rather than wait for the next scheduled run.

In the HVR log file you can see the output of the jobs by clicking on View Log. This log file can be found in $HVR_CONFIG/log/hubdb/hvr.out.

SC-Hvr-Gui demo01 OLTP HDFS viewlog.png

The job output looks like this:

hvr_demo01–cap–db1: Scanned 1 transaction containing 1 row (1 ins) for 1 table.
hvr_demo01–cap–db1: Routed 215 bytes (compression=40.6%) from 'db1' into 2 locations.
hvr_demo01–cap–db1: Capture cycle 3.
hvr_demo01-integ-hdfs: Integrate cycle 1 for 1 transaction file (215 bytes).
hvr_demo01-integ-hdfs: Moved 1 file to 'hdfs://cloudera@192.168.127.128/user/hvr'.
hvr_demo01-integ-hdfs: Processed 1 expired 'trigger' control file.
hvr_demo01-integ-hdfs: Finished. (elapsed=4.04s)

This indicates that the jobs replicated the original change to HDFS. A query on HDFS confirms this:

[cloudera@quickstart ~]$ hadoop fs -ls /user/hvr
Found 2 items
drwxr-xr-x   - cloudera supergroup          0 2015-02-06 10:38 /user/hvr/_hvr_state
-rw-r--r--   3 cloudera supergroup         12 2015-02-06 10:38 /user/hvr/dm01_product_20150206183858632.csv
[cloudera@quickstart ~]$ hadoop fs -cat /user/hvr/dm01_product_20150206183858632.csv

1,19.99,DVD
[cloudera@quickstart ~]$

Show content of HBase table using Hive

SC-HIVE select hbase records.PNG

Select records of HBase table using Hbase Shell

SC-HBASE scan records.PNG

Select specific record in HBase table using HBase shell

SC-HBASE get record.PNG