- March 15, 2016 at 2:47 am #8888
Background (for all DBMS)
HVR typically ‘streams’ data into its target databases; the data goes straight over the network pipe into the DBMS’s bulk loading interface (e.g. Oracle direct-path-load) without touching disk.
We dislike putting data into a temporary file before loading data (this is called ‘staging”), but this is necessary for some DBMS’s that lack a high speed streaming interface.
HVR bulk refresh streams data straight over the network into the target table via its bulk-load interface. If bulk-load streaming is not available then a staging file is used.
HVR row-wise compare and refresh streams data straight over the network into row-wise ‘diff’ logic (no table bulk-load). The fact that HVR does this diff without any staging is quite elegant.
HVR integrate with /Burst streams data straight over the network into the ‘burst’ table via bulk-load interface. If bulk-load streaming is not available then a staging file is used. Afterwards burst uses a ‘set-wise’ insert, update or delete to apply these changes to the target table.
HVR integrate without /Burst (we call this ‘trickle’ internally) streams data straight over the network into HVR’s row-wise ‘integrate’ logic (no table bulk-load)
ODBC, LIBPQ and Bulk-load into PostgreSQL
In the past HVR used ODBC to talk to PostresSQL, but in 4.7.3 we switched to using LIBPQ interface.
One reason that we switched to LIBPQ is because we wanted to use (abuse!) PostreSQL’s “copy (c1, c2, c3) from stdin” statement.
This allows HVR to stream data into PostgreSQL at maximum speed, without having to write data to a temporary file (staging ).
Another reason to switch from ODBC to LIBPQ is that we are building log-based capture from PostgreSQL and that requires LIBPQ too.
Work on building log-based capture from PostgreSQL is proceeding now.
Bulk-load into Greenplum
HVR uses ODBC to talk to Greenplum.
Greenplum does not have a suitable (e.g. sufficiently fast) ‘streaming’ bulk-load interface, so does HVR needs to do staging (load from a temp file) instead.
In the past (early HVR 4.7.1 versions) HVR did this staging-bulk-load by writing into a file and then calling ‘copy (c1, c2) from ‘, but this had disadvantageous;
(a) HVR had to be installed on the Greenplum head node machine.
(b) HVR needed special privileges for this ‘copy’ interface (a stored procedure could be used to avoid this privilege).
(c) It was slower.
So HVR has now dropped this interface.
HVR for Greenplum now uses the ‘gpfdist’ interface.
This requires action LocationProperties with parameters /StagingDirectoryHvr and /StagingDirectoryDb (see http://www.hvr-software.com/wiki/locationproperties).
HVR does bulk-load by writing the data into a temporary file in a staging directory on the machine where the HVR process which is connected to GPDB is running. This directory does not have to be on the GPDB machine; typically it’s not.
This temp-file is written in ‘TEXT’ format and is compressed.
Then HVR tells GPDB to bulk-load the data in this file by doing SQL ‘insert into as select * from ‘. This requires that a special Greenplum ‘external table’ exists for each target table which HVR loads data into. HVR will create these when it needs them and it should reuse them afterwards. These external tables are named ‘__x’ or ‘__bx’. They don’t really contain data; instead they are created with a URL which tells them to load data from a URL which points to the staging directory which HVR just filled with new data. The URL used (supplied with /StagingDirectoryDb) should start with ‘gpfdist://’. This requires that the user is running a Greenplum ‘gpfdist’ daemon on the machine that HVR is using; the GPDB nodes then connect to this gpfdist daemon and use it as a file-server to pull the compressed data into the target table.
Bulk-load into Redshift
Redshift is a postgres-based distributed DBMS that is hosted in Amazon’s cloud.
HVR uses ODBC to talk to Redshift.
Redshift does not have a suitable (e.g. sufficiently fast) ‘streaming’ bulk-load interface, so does HVR need to do staging (loading from temp files).
For Redshift HVR these staging files are located in an ‘S3’ directory (S3 is Amazon’s big-data file system).
This requires action LocationProperties with parameters /StagingDirectoryHvr, /StagingDirectoryDb and /StagingDirectoryCredentials (see http://www.hvr-software.com/wiki/locationproperties).
HVR does bulk-load by first writing the data into temporary S3 staging files.
This S3 temp-file is written in a form of CSV format (currently it is not compressed).
Then HVR uses Redshift SQL ‘copy from s3://’. The Redshift data nodes contact S3 to load the data from those temp files.
No ‘external tables’ are necessary.
We recommend that the HVR process which is talking to Redshift and S3 is running inside the Amazon AWS, so that the network performance is optimal.
You must be logged in to reply to this topic.