Quick Start for HVR into HDFS

From HVR
Jump to: navigation, search

This appendix shows how to set up an HVR channel (called hvr_demo01) to replicate from an Oracle Database into the Hadoop Distributed File System (HDFS). In this example HVR will replicate from a schema inside a single Oracle instance on the hub machine into HDFS. The steps below start by creating new users and tables for HVR to replicate between. In a real live situation tables and data likely already exist.

Create Test Schema and Tables

  1. REDIRECT Create Source Schema and Tables in Oracle

Install HVR on the hub

First read section Introduction which explains the HVR's terminology and architecture. In particular this explains the importance of a hub database.

Then install the HVR software on the hub machine by following the installation steps in section Installing HVR on Windows or Installing HVR on Unix or Linux. If the hub machine is a Unix machine then HVR can either be installed on a Windows PC (so the HVR GUI can run on the PC and connect to the Unix hub machine) or the HVR GUI can be run on the Unix hub machine and connect back to an X server running on a PC.

This Quickstart assumes the Oracle Database on the hub server is also the source database. Most real-time integration scenarios use log-based capture. To enable log-base capture, configure the following:

  • The user name that HVR uses must be in Oracle's group. On Unix and Linux this can be done by adding the user name used by HVR to the line in /etc/group that begins with dba. On Windows right–click My Computer and select Manage ▶ Local Users and Groups ▶ Groups ▶ ora_dba ▶ Add to Group ▶ Add.
  • The Oracle instance should have archiving enabled. Archiving can be enabled by running the following statement as sysdba against a mounted but unopened database: alter database archivelog. The current state of archiving can be checked with query select log_mode from v$database.

The current archive destination can be checked with query select destination, status from v$archive_dest. By default, this will return values USE_DB_RECOVERY_FILE_DEST, VALID, which is inside the flashback recovery area. Alternatively, an archive destination can be defined with the following statement: alter system set log_archive_dest_1='location=/disk1/arc' and then restart the instance.

Install HVR to connect to Hadoop

First read section Introduction which explains the HVR's terminology and architecture. In particular this explains the importance of a hub database.

In order to connect to Hadoop as a target HVR must be running on a Linux environment. This can be the HVR hub running on Linux, or HVR can be running as an agent on a Linux server to connect into HDFS, optionally on the Hadoop namenode. The HDFS interface requires a Java 7 Runtime Environment and Hadoop connectivity libraries.

Install the HVR software on the Linux machine by following the installation steps in section Installing HVR on Unix or Linux. Follow the steps in Requirements for HDFS for the Java Runtime environment and the Hadoop connectivity libraries.

Create the Hub Database

Create the hub database, in which the HVR GUI will store the channel definition. This is actually another user/schema in the Oracle instance.

$ sqlplus system/manager
SQL> create user hvrhub identified by hvr
 2  default tablespace users
 3  temporary tablespace temp
 4  quota unlimited on users;
 
SQL> grant create session to hvrhub;
SQL> grant create table to hvrhub;
SQL> grant create sequence to hvrhub;
SQL> grant create procedure to hvrhub;
SQL> grant create trigger to hvrhub;
SQL> grant create view to hvrhub;
SQL> grant execute any procedure to hvrhub;

Connect To Hub Database

Start the HVR GUI on a PC by clicking on the HVR GUI icon (this is created by the HVR Installer for Windows) or by running hvrgui on Linux.

First, Register the hub database: right–click on hub machines ▶ Register hub. Enter connection details.

SC-Hvr-RegisterHub Oracle remote 4343.png
In this example the hub is a machine called guam, where an INET daemon is listening on port 4343. See section Installing HVR on Unix or Linux for how to configure this.

For a new hub database a dialog will prompt Do you wish to create the catalogs?; answer Yes.

Create Oracle Location

Next create a location for the Oracle source database using right–click on Location Configuration ▶ New Location.

SC-Hvr-Location Oracle db1.png

In this example there is no need to check Connect to HVR on remote machine because testdb1 is on the same machine as the hub.
Ignore the Group Membership tab for now.

Create HDFS Location

Create a location for HDFS using right–click on Location Configuration ▶ New Location.

SC-Hvr-Location HDFS.png

In this example Connect to HVR on remote machine is checked assuming this Linux environment is not the hub machine. If it is then it must not be checked.

Ignore the Group Membership tab for now.

In order to connect to a Hadoop cluster with Kerberos authentication, you can consult HDFS Authentication and Kerberos.

Create a Channel

The next step is to create a channel. For a relational database the channel represents a group of tables that is captured as unit. Create a channel using right–click on Channel Definitions ▶ New Channel.

Choose any name you like.

Create Location Groups

The channel needs two location groups. Under the new channel: right–click on Location Groups ▶ New Group. Enter a group name (for instance OLTP).
SC-Hvr-LocationGroup demo01 OLTP hdfs.png

Add location db1 as a member of this group by checking the box for db1.
Then create a second location group, called DATALAKE that has member hdfs.
The new channel also needs a list of tables to replicate. This can be done as follows; right–click on Tables ▶ Table Explore.

  • Choose the database location ▶ Connect.
  • In the Table Explore window, click on both tables and click Add.
  • In new dialog HVR Table Name click OK.
  • Close the Table Explore window.

Define Actions

The new channel needs actions to define the replication.

  • Right–click on group OLTP ▶ New Action ▶ Capture.
  • Right–click on Group DATALAKE ▶ New Action ▶ FileFormat. Select parameter /Csv, and optionally specify arguments to change the format of the output file.
  • Right–click on Group DATALAKE ▶ New Action ▶ Integrate. Check /RenameExpression, and put in /{hvr_tbl_name}/{hvr_integ_tstamp}.csv to create new files for every integrate cycle. Also add ColumnProperty /Name=hvr_op_val /Extra /IntegrateExpression={hvr_op} /TimeKey to include the operation type per record (0 for delete, 1 for insert, 2 for update). For a very busy system check the option /Burst to combine multiple transactions into one (larger) file per table.
  • Right–click on Group DATALAKE ▶ New Action ▶ Scheduling. Check /IntegrateStartTimes, and select from the calendar. For example for a 10 minute refresh interval starting at the top of the hour check multiples of 10 for resulting RefreshStartTimes /0,10,20,30,40,50 * * * *.

SC-Hvr-Gui-Channel demo01 HDFS.png
Note that the Actions pane only displays actions related to the objects selected in the left–hand pane. So click on channel hvr_demo01 to see both actions.

Enable Replication with HVR Initialize

Now that the channel definition is complete, create the runtime replication system.
Right–click on channel hvr_demo01 ▶ HVR Initialize. Choose Create or Replace Objects and click HVR Initialize.

SC-Hvr-InitializeFinished demo01.png

From the moment that HVR Initialize is done, all new transactions that start on the database testdb1 will be captured by HVR when its capture job looks inside the transaction logs.
HVR Initialize also creates two replication jobs, which can be seen under the Scheduler node in the GUI.

Start Scheduling of Capture Job

Start the Scheduler on the hub machine by clicking in the HVR GUI on the Scheduler node of the hub database.

SC-Hvr-Gui-Scheduler oracle.png

Next, instruct the HVR Scheduler to trigger the capture job to keep the capture job current with transaction log generation.

SC-Hvr-Gui demo01 Start cap HDFS.png

The capture job inside the Scheduler executes a script under $HVR_CONFIG/job/hvrhub/hvr_demo01 that has the same name as the job. So job hvr_demo01–cap–db1 detects changes on database testdb1 and stores these as transactions files on the hub machine.

Perform Initial Load

Perform the initial load from the OLTP database into HDFS using HVR Refresh. Right–click on the channel hvr_demo01 ▶ HVR Refresh.

SC-Hvr-Gui-Refresh HDFS.png

Make sure the option for Online Refresh is checked and select Skip Previous Capture and Integration. Optionally set Parallelism for Tables.
Run the Refresh.

Start Scheduling of Capture Job

Instruct the HVR Scheduler to Unsuspend the integrate job to push changes into HDFS according to the defined schedule.

SC-Hvr-Gui demo01 Unsuspend integ HDFS.png

The integrate job inside the Scheduler executes a script under $HVR_CONFIG/job/hvrhub/hvr_demo01 that has the same name as the job. So job hvr_demo01–integ-hdfs picks up transaction files on the hub on a defined schedule and creates files on HDFS containing these changes.
A scheduled job that is not running is in a PENDING state.

Test Replication

To test replication, make a change in testdb1:

testdb1/hvr
SQL> insert into dm01_product values (1, 19.99, 'DVD');
SQL> commit;

Next, instruct the HVR Scheduler to Trigger the integrate job rather than wait for the next scheduled run.

In the HVR log file you can see the output of the jobs by clicking on View Log. This log file can be found in $HVR_CONFIG/log/hubdb/hvr.out.

SC-Hvr-Gui demo01 OLTP HDFS viewlog.png

The job output looks like this:

hvr_demo01–cap–db1: Scanned 1 transaction containing 1 row (1 ins) for 1 table.
hvr_demo01–cap–db1: Routed 215 bytes (compression=40.6%) from 'db1' into 2 locations.
hvr_demo01–cap–db1: Capture cycle 3.
hvr_demo01-integ-hdfs: Integrate cycle 1 for 1 transaction file (215 bytes).
hvr_demo01-integ-hdfs: Moved 1 file to 'hdfs://cloudera@192.168.127.128/user/hvr'.
hvr_demo01-integ-hdfs: Processed 1 expired 'trigger' control file.
hvr_demo01-integ-hdfs: Finished. (elapsed=4.04s)

This indicates that the jobs replicated the original change to HDFS. A query on HDFS confirms this:

[cloudera@quickstart ~]$ hadoop fs -ls /user/hvr
Found 2 items
drwxr-xr-x   - cloudera supergroup          0 2015-02-06 10:38 /user/hvr/_hvr_state
-rw-r--r--   3 cloudera supergroup         12 2015-02-06 10:38 /user/hvr/dm01_product_20150206183858632.csv
[cloudera@quickstart ~]$ hadoop fs -cat /user/hvr/dm01_product_20150206183858632.csv

1,19.99,DVD
[cloudera@quickstart ~]$