MPP Databases: Even Data Distribution
Massively-Parallel Processing (MPP) databases such as Teradata, Greenplum or Actian Matrix (formerly ParAccel) achieve ultimate performance through parallel execution across multiple servers. In order for this approach to work well data must be distributed as evenly as possible, and the workload must run across the distribution.
This sounds a little vague so let’s use an example of bad distribution as a starting point…
Imagine a table ORDERS is distributed by ORDER_DATE (with no time component) and assume there are many orders per day. Because of the chosen distribution all orders for a particular date are stored in the same partition, i.e. on the same node. If users commonly query orders for a particular date, then any such query always hits one and only one partition (and hence only one server). Adding servers in that case won’t increase performance, it will only increase complexity, and better performance can only be achieved with a faster server.
A better distribution key in this example may be ORDER_ID, assuming this is a numerical column that simply increments for every new order and acts as the primary key. In that case orders get nicely distributed across all partitions (typically through a hash function) and a query retrieving all orders for a day (with many orders per day), or a query retrieving all orders for a customer (assuming most customers have multiple orders), retrieve rows from all partitions and the system can scale by adding servers.
Equally if not more important than equal (random) distribution of an individual table’s data across all nodes in a cluster is having large tables that are joined frequently co-partitioned in the database. If tables are not co-partitioned data has to be redistributed across nodes which is an extra step that, for large tables, may flood the network between the servers (the interconnect). In the example above, if there is a LINEITEM table as a child table to ORDERS with one or more LINEITEM rows per order, the LINEITEM table should be partitioned by ORDER_ID as well. In that case a join between ORDERS and LINEITEM can be performed by every parallel process because matching orders and lineitem entries are stored in the same partition.
You may wonder: what does this have to do with real-time data integration?
A few weeks ago I wrote a blog post about the use of the Burst option in the context of columnar databases (especially columnar databases with no indexes). Similar to columnar databases most MPP databases are not every efficient at processing single-row transactions. I.e. rather than applying every single row change as a single row change in an MPP database, and generating an awful lot of overhead for such a DML operation (with for example a distributed commit across a cluster) you will get far greater throughput if you bulk load changes into a staging table and then apply changes using batch SQL statements against the target tables, in near real-time. This happens to be exactly what HVR can do behind the scenes using the Burst option.
Going back to the story about parallelism, scalability and co-partitioned tables: in order to get a scalable batch statement when running the SQL you want the staging table to be co-partitioned with the target table. In HVR you specify this using an action ColumnProperties /DistributionKey (the default distribution key is the first column in the table).
Want to see how it is done? Contact us for a demo.