Large data mysqldump to Greenplum

http://cdn.jpillora.com/js/jp-prettify.js Recently I needed to load a single table from a transaction system with few hundred million rows into Greenplum/Postgresql from MySQL.  MySQL schema didn’t have many tables but one single table was large with around 50G size including data and index.  Ended up testing with 2 different techniques below.

Technique 1: Using mysqldump and Postgresql inserts

In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility

> psql -h HOST -U USER -f FILENAME   

but it turned out be intersting challenge with many changes needed to load successfully.
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It’s dump creates a file that is not fully compatible with direct load into postgresql.  Dump even with –compitable=postgresql option didn’t help much.

One of the major issue while loading huge file with psql utility the “Out of memory” error even with reasonably small file, say 1G.

ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1414939983 more bytes.

As a first step I removed all MySQL comments and anything other than data with INSERT INTO statement.

Example lines removed are below.

-- MySQL dump 10.13  Distrib 5.1.42, for krobix-linux-gnu (x86_64)
--
-- Host: localhost    Database: supply_production
-- ------------------------------------------------------
-- Server version 5.0.45

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;

And retained any lines between
   INSERT INTO

line and

ENABLE KEYS line.  

Used a script to perform the filtering.

This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB!   These are long lines with thousands and thousands of records.  At certain intervals, 100,000 or so, Clustrix inserted new row with “INSERT INTO …”.  I removed these extra inserts comands and split the records with perl simpel one liner

>  perl -pi -e 's#\)\,\(#\)\,\n\(#g'  

thus inserting new line at the end of each record and the new file had around 200 million lines now.

With continued error of “Out of memory” you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn’t be the case.  With INSERT INTO …. VALUES ( …. ) statement there is no need to do so.  Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing “,” at the end of last line.

After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing “,” etc.

Split command used

> split --lines=500000 FILENAME

Adding “INSERT INTO ….” to each of these files and instead of seeking to end of file and removing extra “,”, I added a new dummy line which I can delete later from uploaded table.

> for fn in `ls x*`;
    do echo "Working on $fn";
     echo "INSERT INTO schema.table VALUES " > "${fn}_r_l";
      cat $fn >> "${fn}_r_l";
      echo "(column_1, column_2, column_3,....column_N)" >> "${fn}_r_l" ;
   done

This created for each split file corresponding file with “_r_l” suffix (ready_to_load).

Then loaded the table

> for fn in `ls xd*_r_l`;
    do
      echo "Loading $fn";
      psql -h HOST -U USER -d DATABASE -f "FILENAME";
    done

Systems and utilities used:

  Greenplum DB – Greenplum Database 4.0.6.0 build 4
  Postgresql – PostgreSQL 8.2.14
  MySQL – 5.0.45-clustrix-v4.1
  Perl – 5.8.8 multithreaded
  Bash
  All running on linux x86_64 with 24G memory

There were more than 400 files with 0.5G data loaded in less than three hours.  Still substantial but it is one time load and was acceptable.

Technique 2:   Using mysqldump and Greenplum gpload

Greenplum’s bulk loading utility (gpload) is an excellent one to load large data set.  After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.

gpload  -f  $gpload_ctl_file

with control file created dynamically from a template.  For example in the below table replace all place holders with respective values.  With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.

VERSION: 1.0.0.1
DATABASE:
USER:  
HOST:  
PORT:  
GPLOAD:
   INPUT:
     – SOURCE:
         LOCAL_HOSTNAME:
           –
         PORT:
         FILE:
           –
     – FORMAT: text
     – DELIMITER: ‘|’
     – NULL_AS: ‘NULL’
     – ERROR_LIMIT: 25
     – ERROR_TABLE: sandbox_log.gpload_errors
     – COLUMNS:
         – timestamp: text
         – priority: text
     …
     …
 PRELOAD:
    – TRUNCATE: false
   OUTPUT:
     – TABLE:
     – MODE: INSERT

This is a much faster and efficient loading than technique 1.

HTH,
Shiva

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s