HVR 5.2: Data Lake

HVR 5.2 provides even faster, more secure, and accurate data movement in a modern environment. Latest enhancements include features that help customers optimize their adoption of data lake technologies. Read on to learn more…

Defining Data Lakes

A few months ago in a joint webinar titled “Five Key Questions About Data Lakes” with TDWI’s Senior Research Director for Data Management, Philip Russom, we defined the concept of the data lake and busted a handful of myths around the use of data lakes. (view the recording of the webinar)

To summarize, data lakes:

  • Organize large, diverse sets of data in their raw, detailed state.
  • Enable access to data with minimal latency.
  • Can be implemented using a variety of technologies or even a combination of data management platforms. Per TDWI’s research 53% of the data lakes are implemented on Hadoop.

Common Challenges with Data Lakes
Data Lake Hive Hadoop

In the last 8-12 months, we have seen an increased interest in continuous integration into data lakes implemented on file systems like Hadoop and S3. When early data lakes may have been centered around sensor-generated data, logs, or social media, we see an increased interest in building solutions on top of data lakes also using data from traditional relational database applications like ERP systems.

Including data from traditional database applications in a data lake on a file system introduces a number of challenges, including these four:

  1. Dealing with not only inserts but also updates and deletes
  2. Loss of metadata like column definitions and data type information
  3. Different transactional behavior and with that managing data integrity
  4. Data security and access rules

HVR 5.2: Simplify Your Data Lake Deployment 

HVR 5.2 introduces a number of exciting new features to overcome these challenges and simplify successful data lake deployments. The new features are: Native Hive Support, S3 Optimizations, Support for Amazon Key Management Service, Metadata Manifest, and Big Data Compare.

Data Lake and HiveNative Hive Support

Hive external tables can now be created directly on top of data integrated into HDFS or S3. This results not only in table definitions with correct data types, but also table changes on the source can automatically be propagated into Hive. Hive external tables facilitate data exploration by data scientists who can directly use off-the-shelf Business Intelligence tools to query data in the data lake.

Amazon S3 Data Lake Integration S3 Optimizations

Integrating changes into the target data store is arguably fast enough when output can be written faster than the source system can generate the changes. However, in our ongoing commitment to high volumes and scale, we optimized writing of files which led to up to 5x faster performance for writes into S3.

Amazon Key Management Service Support

Integration with Amazon’s Key Management Service (KMS) to facilitate client-side encryption is now available out of the box. HVR’s communication protocol already supported SSL encryption, but data could still be exposed when writing into a remote file system like S3. Client-side encryption eliminates this exposure by encrypting the data before it is put on the wire.

Metadata Manifest

To address the challenge of managing data consistency and integrity on a file system HVR now enables data publication through so-called manifest files. The manifests are created at a transactionally consistent point containing metadata about the incremental data set that is presented to the consumer.

Big Data Compare

big data It may take time for tech-savvy users, who understand the difference between solid transactional databases and file systems, to start relying on the data lake. Big Data” compare, the ability to compare an audit trail of changes with the current state of a transactional database, facilitates building trust in the data lake. Big data compare is also available for data on HDFS and S3 through Hive.

 

In the coming months, we will be writing more blog posts about individual capabilities in the HVR 5.2 release. However, if you’d like to learn more sooner, contact us and we’ll be more than happy to talk to you about the details, and what else is in the new release.

About Mark

Mark Van de Wiel is the CTO for HVR. He has a strong background in data replication as well as real-time Business Intelligence and analytics.

© 2017 HVR Software

Free Trial Contact Us