Large data mysqldump to Greenplum Recently I needed to load a single table from a transaction system with few hundred million rows into Greenplum/Postgresql from MySQL.  MySQL schema didn’t have many tables but one single table was large with around 50G size including data and index.  Ended up testing with 2 different techniques below.

Technique 1: Using mysqldump and Postgresql inserts

In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility

> psql -h HOST -U USER -f FILENAME   

but it turned out be intersting challenge with many changes needed to load successfully.
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It’s dump creates a file that is not fully compatible with direct load into postgresql.  Dump even with –compitable=postgresql option didn’t help much.

One of the major issue while loading huge file with psql utility the “Out of memory” error even with reasonably small file, say 1G.

ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1414939983 more bytes.

As a first step I removed all MySQL comments and anything other than data with INSERT INTO statement.

Example lines removed are below.

-- MySQL dump 10.13  Distrib 5.1.42, for krobix-linux-gnu (x86_64)
-- Host: localhost    Database: supply_production
-- ------------------------------------------------------
-- Server version 5.0.45


And retained any lines between

line and


Used a script to perform the filtering.

This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB!   These are long lines with thousands and thousands of records.  At certain intervals, 100,000 or so, Clustrix inserted new row with “INSERT INTO …”.  I removed these extra inserts comands and split the records with perl simpel one liner

>  perl -pi -e 's#\)\,\(#\)\,\n\(#g'  

thus inserting new line at the end of each record and the new file had around 200 million lines now.

With continued error of “Out of memory” you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn’t be the case.  With INSERT INTO …. VALUES ( …. ) statement there is no need to do so.  Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing “,” at the end of last line.

After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing “,” etc.

Split command used

> split --lines=500000 FILENAME

Adding “INSERT INTO ….” to each of these files and instead of seeking to end of file and removing extra “,”, I added a new dummy line which I can delete later from uploaded table.

> for fn in `ls x*`;
    do echo "Working on $fn";
     echo "INSERT INTO schema.table VALUES " > "${fn}_r_l";
      cat $fn >> "${fn}_r_l";
      echo "(column_1, column_2, column_3,....column_N)" >> "${fn}_r_l" ;

This created for each split file corresponding file with “_r_l” suffix (ready_to_load).

Then loaded the table

> for fn in `ls xd*_r_l`;
      echo "Loading $fn";
      psql -h HOST -U USER -d DATABASE -f "FILENAME";

Systems and utilities used:

  Greenplum DB – Greenplum Database build 4
  Postgresql – PostgreSQL 8.2.14
  MySQL – 5.0.45-clustrix-v4.1
  Perl – 5.8.8 multithreaded
  All running on linux x86_64 with 24G memory

There were more than 400 files with 0.5G data loaded in less than three hours.  Still substantial but it is one time load and was acceptable.

Technique 2:   Using mysqldump and Greenplum gpload

Greenplum’s bulk loading utility (gpload) is an excellent one to load large data set.  After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.

gpload  -f  $gpload_ctl_file

with control file created dynamically from a template.  For example in the below table replace all place holders with respective values.  With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.

     – SOURCE:
     – FORMAT: text
     – DELIMITER: ‘|’
     – NULL_AS: ‘NULL’
     – ERROR_LIMIT: 25
     – ERROR_TABLE: sandbox_log.gpload_errors
     – COLUMNS:
         – timestamp: text
         – priority: text
    – TRUNCATE: false
     – TABLE:

This is a much faster and efficient loading than technique 1.


Greenplum Postgres Regexp fun Last few months I started woking in Greenplum/ postgres environment and was pleasantly surprised to see the regular expression capabilities.  Compared to Mysql (5.1) it has quite advanced RE features. For example I ended up using the following query to strip the decimal numbers from another string.

Input strings has the format “BucketN_N[ -N | + ]” where N is decimal number.  For example, “Bucket1_1” or  “Bucket7_21-50” or “Bucket5_100+” are all valid and output should extract bucket number (first N), minium value (second N) and maximum value of the bucket (third optional N).
The query 

  ,substring(qty_bucket, 7, 1)::integer as bucket_num
  ,regexp_replace(bucket_string, '^Bucket\\d{1,}_(\\d{1,})([-+].*)?', E'\\1')::integer as minimum_qty
  ,regexp_replace(regexp_replace(bucket_string, '.*-(\\d{1,})', E'\\1'),  '.*_(\\d{1,})([+]?)$', E'\\1')::integer
as maximum_qty
FROM a_table
ORDER BY bucket_string
Though I could have used other functions like substring and other, it was fun using regular expression in smallest code.  regexp_replace is the workhorse taking a string and replacing it with matched string (\\1 or $1).  “^ and $” are anchors to beginning and end of string,  \\d matches with decimal numbers and {1,} means 1 to many quantifier.

And more with regexp_matches and array selection.  I wanted to select offset number of days from 2012 December 01 in this simplified example.  The offset is a text in another external table of Greenplum and data is not always clean and could have some text before the offset.

regexp_matches(d, '(?:.*_)?(\\d+)$')    re_ary
, '2012-12-01'::date+((regexp_matches(d, '(?:.*_)?(\\d+)$'))[1] || 'days')::interval   new_d
    SELECT '90' d
    SELECT 'new_90' d
    SELECT 'old_90_10' d
) tt


d re_ary new_d
90 {90} 2013-03-01 00:00:00
new_90 {90} 2013-03-01 00:00:00
new_90_10 {10} 2012-12-11 00:00:00

With “?:” is non-capturing group and this matched text is optional meaning any text before string ending number is not-captured for later use and original string may have it but not all the time. ‘$’ anchors it to the end of string with regexp_matches returning an array of single element and subscript [1] selecting it.