Large data mysqldump to Greenplum

http://cdn.jpillora.com/js/jp-prettify.js Recently I needed to load a single table from a transaction system with few hundred million rows into Greenplum/Postgresql from MySQL.  MySQL schema didn’t have many tables but one single table was large with around 50G size including data and index.  Ended up testing with 2 different techniques below.

Technique 1: Using mysqldump and Postgresql inserts

In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility

> psql -h HOST -U USER -f FILENAME   

but it turned out be intersting challenge with many changes needed to load successfully.
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It’s dump creates a file that is not fully compatible with direct load into postgresql.  Dump even with –compitable=postgresql option didn’t help much.

One of the major issue while loading huge file with psql utility the “Out of memory” error even with reasonably small file, say 1G.

ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1414939983 more bytes.

As a first step I removed all MySQL comments and anything other than data with INSERT INTO statement.

Example lines removed are below.

-- MySQL dump 10.13  Distrib 5.1.42, for krobix-linux-gnu (x86_64)
--
-- Host: localhost    Database: supply_production
-- ------------------------------------------------------
-- Server version 5.0.45

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;

And retained any lines between
   INSERT INTO

line and

ENABLE KEYS line.  

Used a script to perform the filtering.

This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB!   These are long lines with thousands and thousands of records.  At certain intervals, 100,000 or so, Clustrix inserted new row with “INSERT INTO …”.  I removed these extra inserts comands and split the records with perl simpel one liner

>  perl -pi -e 's#\)\,\(#\)\,\n\(#g'  

thus inserting new line at the end of each record and the new file had around 200 million lines now.

With continued error of “Out of memory” you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn’t be the case.  With INSERT INTO …. VALUES ( …. ) statement there is no need to do so.  Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing “,” at the end of last line.

After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing “,” etc.

Split command used

> split --lines=500000 FILENAME

Adding “INSERT INTO ….” to each of these files and instead of seeking to end of file and removing extra “,”, I added a new dummy line which I can delete later from uploaded table.

> for fn in `ls x*`;
    do echo "Working on $fn";
     echo "INSERT INTO schema.table VALUES " > "${fn}_r_l";
      cat $fn >> "${fn}_r_l";
      echo "(column_1, column_2, column_3,....column_N)" >> "${fn}_r_l" ;
   done

This created for each split file corresponding file with “_r_l” suffix (ready_to_load).

Then loaded the table

> for fn in `ls xd*_r_l`;
    do
      echo "Loading $fn";
      psql -h HOST -U USER -d DATABASE -f "FILENAME";
    done

Systems and utilities used:

  Greenplum DB – Greenplum Database 4.0.6.0 build 4
  Postgresql – PostgreSQL 8.2.14
  MySQL – 5.0.45-clustrix-v4.1
  Perl – 5.8.8 multithreaded
  Bash
  All running on linux x86_64 with 24G memory

There were more than 400 files with 0.5G data loaded in less than three hours.  Still substantial but it is one time load and was acceptable.

Technique 2:   Using mysqldump and Greenplum gpload

Greenplum’s bulk loading utility (gpload) is an excellent one to load large data set.  After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.

gpload  -f  $gpload_ctl_file

with control file created dynamically from a template.  For example in the below table replace all place holders with respective values.  With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.

VERSION: 1.0.0.1
DATABASE:
USER:  
HOST:  
PORT:  
GPLOAD:
   INPUT:
     – SOURCE:
         LOCAL_HOSTNAME:
           –
         PORT:
         FILE:
           –
     – FORMAT: text
     – DELIMITER: ‘|’
     – NULL_AS: ‘NULL’
     – ERROR_LIMIT: 25
     – ERROR_TABLE: sandbox_log.gpload_errors
     – COLUMNS:
         – timestamp: text
         – priority: text
     …
     …
 PRELOAD:
    – TRUNCATE: false
   OUTPUT:
     – TABLE:
     – MODE: INSERT

This is a much faster and efficient loading than technique 1.

HTH,
Shiva

Simple time series metric

Here is an example of a fusion chart recently created using MySQL, Perl, PHP and Fusion charts with fudged data. X axis is date & hour for last 7 days.  The data is pulled from MySQL and generates a XML file for consumption by PHP and Fusion.

The time data is in UTC (unix timestamp and first second of each hour. For example, 1262307600 => 2010-01-01 01:00:01) summarized with other dimensions and metrics. The simplified fact table looks like

ts_begin_hr, dim_1, dim_2, metric_1, metric_2, etc.
The query used to pull required report data.
         SELECT ‘dim1’ dim, concat(D_H.d, ‘ ‘, D_H.hr) pst,
             CASE WHEN P.met1 IS NULL THEN  0.0  ELSE P.met1 end METRIC
         FROM
            (
             SELECT d.d, h.hr 
             FROM
              (SELECT ’00’ hr UNION SELECT ’01’ UNION SELECT ’02’ UNION SELECT ’03’ UNION SELECT ’04’
               UNION SELECT ’05’ UNION SELECT ’06’ UNION SELECT ’07’ UNION SELECT ’08’ 
               UNION SELECT ’09’ UNION SELECT 10 UNION  SELECT 11 UNION SELECT 12 
               UNION SELECT 13  UNION SELECT 14 UNION  SELECT 15 UNION SELECT 16 
               UNION SELECT 17 UNION SELECT 18 UNION   SELECT 19 UNION SELECT 20 
               UNION SELECT 21 UNION SELECT 22 UNION SELECT 23
              ) h
              CROSS JOIN
              (SELECT full_date d FROM DW_DB.dim_date
               WHERE full_date BETWEEN date_add(current_date(), interval -6 day) AND current_date()
              ) d  — 1
            ) D_H
            LEFT OUTER JOIN
            (
              SELECT ‘dim1’ dim,
                  , date(convert_tz(from_unixtime(ts_begin_hr), ‘GMT’, ‘America/Los_Angeles’)) d
                  , hour(convert_tz(from_unixtime(ts_being_hr), ‘GMT’, ‘America/Los_Angeles’)) hr
                  , sum(met1) met1
              FROM DW_DB.FACT_TABLE FT       
              JOIN DW_DB.DIM_TABLE DM
                ON FT.dim_id = DM.dim_id
              WHERE from_unixtime(ts_begin_hr) >= date_add(now(), interval -7 day)
              GROUP BY dim, d, hr
            ) P  — 2
         ON D_H.d = P.d and D_H.hr = P.hr
         WHERE D_H.d < date(convert_tz(now(), 'GMT', 'America/Los_Angeles'))
            OR (D_H.d = date(convert_tz(now(), ‘GMT’, ‘America/Los_Angeles’))
                AND D_H.d <= hour(convert_tz(now(), 'GMT', 'America/Los_Angeles'))
              )
         ORDER BY D_H.d, D_H.hr
        ;”;
Subquery 1 (in blue) gets a cross-joined table of 24 hours (0..23) with last 6 dates resulting in output like YYYY-MM-DD HH.  Single digit hours are converted to two digits in string format (SELECT ‘0n’).  The system and mysql were running in UTC timezone while business users needed them in PST tz.  Subquery 2 (in brown)  generates required metric in PST and joined with subquery 1.  With where clause any extra hours (later than current hour) is filtered out in left outer join.  
Subquery 1 is joined with left outer join for possibility that during a given hour there might not be metric (say, dollar amount) available because there might not be any transaction during that hour for a given dimension.
The result set is dumped to XML file for Fusion chart consumption resulting in above output through php wrapper. The chart is updated every hour.

User traversal digraphs

Visualizing the users’ traversal on a website or path traversed in an application or nodes visited may need to be analyzed for better user experience or improving the efficiency the path traversed or to improve the conversion rate by getting the user to the end state. For example, it may be that users visiting one set of nodes (node meaning particular state of application like tutorial state machine) convert better than users going through different set of states. There are many cases where a graph or directed graph is a necessity.

You can use freeware tools like ‘dot’ to draw directed graphs. In the following example, I assume the state/node is represented by simple integer and path taken by the edge (directed line).

Here, I have shown a simplified version of a project I worked on in quickly generating a digraph that can be sent or uploaded to site for internal business users on regular intervals in real time. Few assumptions are made to help explain the main idea and they are – each visit to the node is timestamped and user can only traverse forward (meaning visit to the node number higher than the current one). To start with data is summarized in a table with each user or device or some unique key identifier, node visited, and time of visit.

Order the data in chronological order for each unique id so that by doing the self-join (see code snippet below) in SQL we can simply find out the next node the user visited.

SELECT T1.node_id current_node, T2.node_id next_node, count(1) ct
FROM table T1 JOIN table T2 ON T1.row_id+1 = T2.row_id
WHERE T1.unique_id = T2.unique_id # say, user_id
GROUP BY T1.node_id, T2.node_id

You can also provide weightage/cost to the edge by normalizing the count which will result in set of rows similar to
      Current_node, next_node, count, count/max_count

This is all the data we need to generate input file for ‘dot’ application. Write a program that takes the above input and dump it into a file with content –

Digraph G {
   # splines=false;
   rankdir=LR

   2 ->3 [penwidth=4.9, label=”1190″]
   3 -> 4 [penwidth=4.9, label=”1150″]
   4 -> 5 [penwidth=4.8]
   5 -> 6 [penwidth=4.8]
   6 -> 7 [penwidth=4.8]
   7 -> 8 [penwidth=4.8]
   …
   …
}

By providing this as input you can generate the output in multiple formats including pdf, ps, etc. See graph 1 below.  You can provide more input parameters in the file to fancy the graph or with more information like drop-off (% of users dropped) between states, see graph 2. In essence you are generating a digraph to visualize the data in a more meaningful way.

Digraph 1 – with sample weights between states 2 & 3, 3 & 4

With dot input file content like

  subgraph cluster1 {
    rankdir=”LR”
    2 -> 3  [penwidth=”23″, label=”23.48%”,  taillabel=”4450″,  headlabel=”3405″]
    3 -> 4  [penwidth=”25″, label=”24.9%”,  taillabel=”3405″,  headlabel=”2557″]
    4 -> 5  [penwidth=”18″, label=”18.34%”,  taillabel=”2557″,  headlabel=”2088″]
    5 -> 6  [penwidth=”19″, label=”19.3%”,  taillabel=”2088″,  headlabel=”1685″]
    6 -> 7  [penwidth=”20″, label=”20.18%”,  taillabel=”1685″,  headlabel=”1345″]
    7 -> 8  [penwidth=”26″, label=”26.47%”,  taillabel=”1345″,  headlabel=”989″]
    8 -> 9  [penwidth=”35″, label=”35.29%”,  taillabel=”989″,  headlabel=”640″]
    9 -> 10  [penwidth=”39″, label=”38.59%”,  taillabel=”640″,  headlabel=”393″]
    10 -> 11 [penwidth=”36″, label=”35.88%”,  taillabel=”393″,  headlabel=”252″]
  }

Digraph 2 – with users drop-off between states in %

Joins between tables of different charset

Not too long ago I got bitten by a slow performing MySQL query even though I had right indices (in fact unique index) and the tables were not big tables. The query which is similar to below was taking more than 20 mins which I ended up killing it.

The issue was one table was storing data in “latin1” while other in “utf8”.
select count(1)
from table_1 T1
left outer join table_2 T2
  on T1.id = T2.id
where T2.id is null
Note: “id”s here are varchars and they are not integers.
T1 had around 100K while T2 has around 800K joining on MySQL5.1, quad machine with 16G RAM.  I expected much quicker response and on running the explain plan I could see the query doing the full table scan!  First I wasn’t sure why this was happening but after digging little bit and found out that one table was defined with “charset=latin1” and another with “charset=utf8”.
MySQL was converting latin1 to utf8 and then performing the join which ended up with full table scan. On converting T2 to utf8 with collation utf8_unicode_ci, the query timing dropped to 10secs or so!
With that said by default now we create all tables with charset=utf8. 

TImezone conversion for hourly report

We recently built a real time hourly report where data is streamed in and summarized through multiple jobs and pushed into reporting tables.  The business users wanted the reports as quickly as possible and first phase of the project needed to be implemented in short time.  With very limited time from concept to design and implementation, I utilized 3rd party web charting service (www.chart.io) and pushed the highly summarized data to set of tables.

During a final reporting, I needed to convert date and hour in UTC to PST.  The table had date (date type) and hour (tinyint) columns.  Chart.io uses simple x, y chart (line, bar, etc.) and takes any co-ordinates passed for them and displays them.  See below for one of final chart.
X axis is date and hour with format ‘YYYY-MM-DD HH’.  The following SQL converted the date and hour from corresponding 2 columns with format YYYY-MM-DD and HH respectively in UTC timezone to PST. 
SELECT substr(convert_tz(concat(date, ‘ ‘, if(length(hour) = 1, concat(‘0’,hour), hour)), ‘GMT’, ‘America/Los_Angeles’), 1, 13) as x,
FROM

WHERE …
GROUP BY x…
ORDER BY x ASC
Since maintaining the order is important. For example ‘2011-11-11 02’ hour is right after ‘2011-11-11 01’ hour and before ‘2011-11-11 03′, single digit hours (1 through 9) are concatenated with “0’.  Then convert_tz converted the GMT timezone to PST (America/Los_Angeles) time.  There was suggestion to simply use -8.00 in hardcoded timezone difference but will result in incorrect hour during the months day light saving time (summer).   Since data is stored for each hour in tables over many months and years, by having the above technique one doesn’t need to worry about this issue.  The system will automatically and correctly offset the hours in the report.  Handling timezone conversion is a not a trivial one and see timezone wiki for different timezones around the world.  For timezone conversion to work in MySQL, one should load the appropriate data into tables like this.

The reports/charts generated were very useful to business user and helped them take some quick decisive decisions and help build better user experience and improve sales and rev!

Converting 12 hour time Unix timestamp

Many a times I need to transform time in AM/PM format to unix timestamp for better performance.  The external data coming in has date and time in two separate columns like ‘2011-01-01′ as date and ’11:59:59 PM’ as time.   And some other time, I also need to convert timezones typically from a UTC to PST and to do that check this out.

For example, converting “2011-01-01 11:59:59 PM” into corresponding unix timestamp results in an integer 1293926399.

# Create a temp table…
create table tmp.test_dt (
  d date,
  t varchar(20)
)engine=InnoDB


# Insert a value to test…
insert into tmp.test_dt
(d, t) values
(‘2011-01-01′, ’11:59:59 PM’)


# Run the conversion sql…
SELECT 
   unix_timestamp(concat(cast(d as char),  ‘ ‘,  cast(time_format(str_to_date(t,’%r’),’%T’) as char)  ) )
FROM tmp.test_dt
# Output: 1293926399


# To test
SELECT from_unixtime(1293926399)
# Output: “2011-01-01 23:59:59”


In the conversion select statement, time_format converts the time (t column) from 11:59:59 PM to 23:59:59 of datetime type which is then cast as char to concatenate with date.  After concat the string looks like “2011-01-01 23:59:59” which becomes input to unix_timestamp.

Loading timezones into MySQL or InfiniDB

If you are running queries or reports that utilize mysql timezone convertion function (CONVERT_TZ) make sure that you have run mysql utility (mysql_tzinfo_to_sql) that loads timezone information from the system into mysql server first. Without having loaded the timezone information one will simply get NULL return value and no message or error. This is expected behavior because the required tables for MySQL already exist but they are not populated!

In many threads and forums I have seen developers asking question on how to convert one timezone to another while all they get is null return value.  I myself have bitten by it couple of times when I moved to new MySQL servers where the following steps were not done.

Here I am using Ubuntu which has timezone files under /usr/share/zoneinfo directory. In many flavors of unix including Mac OS you will find it there, if not search for it with find command 😉

Also, note you can load all timezones of the world in one go. The SQL created by this utility is large with more than 142K lines of SQL code! After completing the task the following tables must have been populated.

  1. mysql.time_zone
  2. mysql.time_zone_name
  3. mysql.time_zone_transition
  4. mysql.time_zone_transition_type

There is another related table mysql.time_zone_leap_second which we don’t need at this point.

# To load all timezone
shell> mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql

# You may see output like:
Warning: Unable to load '/usr/share/zoneinfo/Asia/Riyadh87' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/Asia/Riyadh88' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/Asia/Riyadh89' as time zone. Skipping it.
# On my system some specific timezone of Riyadh are not defined. And you can ignore these warnings.

Restart mysql or infinidb server to bring in the timezones and run a test.

# In case of infinidb idbmysql is just an alias
shell> alias idbmysql
alias idbmysql='/usr/local/Calpont/mysql/bin/mysql --defaults-file=/usr/local/Calpont/mysql/my.cnf -u root'

shell> idbmysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 16
Server version: 5.1.39 MySQL Community / Calpont InfiniDB Community 2.2-2 Final (COSS LA)

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. 

mysql> SELECT utc_timestamp(), CONVERT_TZ(utc_timestamp(),'UTC','MET');
+---------------------+-----------------------------------------+
| utc_timestamp() | CONVERT_TZ(utc_timestamp(),'UTC','MET') |
+---------------------+-----------------------------------------+
| 2011-07-29 01:14:40 | 2011-07-29 03:14:40 |
+---------------------+-----------------------------------------+
1 row in set (0.00 sec)

MySQL Connection over SSH

When MySQL server is setup to accept only localhost connection through bind_address in /etc/mysql/mysql.cnf (configuration)

# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address          = 127.0.0.1

or you want to connect as “root” to mysql and use GUI clients like Workbench or Toad, etc. you can use SSH tunneling.

In operating systems that provide built-in ssh support it is made little easier.  Mac OS is one such OS, which I will use as an example below.

Assuming you have already setup ssh connection, run the following command from the local terminal. If you have not setup the ssh, see this brief introductory ssh article. Or run > man ssh for more details.

prompt>  ssh -L 3307:localhost:3306  user@remote.com

and then let this terminal running.  Here I am using ssh -L [bind_address:]port:host:hostport  parameter and skipped bind_address which is needed only when you have more than one IP attached (multi-IP) to your local system which is typically not the case.

    3307 – The port that you will use on the other clients that need to connect to remote mysql. For example, Workbench.

    localhost:3306 – The remote login details as if you had connected to it through ssh and connecting to mysql instance running on port 3306.  If you had successfully, ssh-ed to remote host at the prompt you would have run > mysql -h localhost -u user_name -P 3306

    user@remote.com – Remote user name and remote host

Now start Workbench and setup new connection string by providing the details as if you are connecting to a mysql instance on your local machine running on port 3307. See figure 1.  ssh will automatically tunnels your connection to remote host’s mysql! See figure 2.  On ending the tunneling setup either my closing the terminal where you were running “ssh -L” command and killing it will disconnect the connection to Workbench.

Figure 1.

Figure 2.

Note: Same technique can be used to for other client applications to connect to any of remote applications/databases over SSH.

Have fun,

Shiva

DW solution – IP mapping to Geo Locations

In my previous few posts on the blog, I covered how to convert IP addresses to corresponding integers and back. When performing a Data Warehouse (DW) tables join between IP addresses for location analysis joining on integer IP column is much faster than joining on string column.  As noted in other posts, this article covers IPv4 but technique could be used for IPv6 addresses on using 2 BIGINT.

For geo location info, you can use MaxMind’s Geo-City data that is provided in two entities – one with mapping IP block to a location ID and another one with location ID to city/state/country.  It is one to many (1:M) relationship and to find an IP address location one need to find the matching IP block and then map it to city or country.  There are posts that describe in detail how to perform it effeciently for a given IP.  These techniques will work for an OLTP environment where you pass known IP address from UI or application layer to database and gets its city/country.

  1. Fast MySQL Range Queries on MaxMind GeoIP Tables byAndy Skelton
  2. MaxMind GeoIP Install Setup Tutorial by Bartomedia
  3. Finding a user’s country/region based on IP by Shlomi Noach
  4. On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS by Jeremy Cole

Geo-City data has following fields seperated by comma and row terminated by new line (\n):

  • Location ID
  • Country Code
  • State/Region
  • City Name
  • Postal Code (US only)
  • Latitude
  • Longitude
  • Metro Code
  • Area Code

IP Block has:

  • Start IP Address
  • End IP Address
  • Location ID

While loading this dimensional data during ETL process, you should also do the IP address transformation to integer as explained in “Converting an IP address to an integer“.  And Geo-City dimensional table will also have partition range (the reason for this post) and surrogate key (id).  You should also perform other data validation like non-overlapping ranges, valid country codes, etc. which are all typical standard activities of DW ETL process.  The time, resource and effort put into cleaning and validating the data before pushing the data into final fact tables will pay well within short period of time.  To fix a data issue and reconcile the reports in DW/BI environment always leads to lower ROI.

You can also denormalize the above tables and create a single table.  In either case by joining on range of IP address for a given set of IPs will result in the same slow performance.

dim_geo_ip_city_location holds around 5.9million rows with following attributes:

CREATE TABLE dim_geo_ip_city_location (
location_id int NOT NULL,
    start_ip varchar(16) NOT NULL,
    end_ip  varchar(16) NOT NULL,
    country char(2),
    region varchar(64),
    city varchar(128),
    postal_code varchar(32),
    latitude decimal(8,4),
    longitude decimal(8,4),
    metro_code varchar(8),
    area_code varchar(8),
...
)ENIGNE=...

CREATE TABLE dim_ip_block (
    start_ip_num bigint NOT NULL,
    end_ip_num bigint NOT NULL,
    location_id int NOT NULL,
        ...
)ENGINE=...

In MySQL (5.0) optimizer performs range operation effeciently with a constant meaning that you have given the query an IP address like “WHERE ip_addr between 1230 and 1234”, for example, but not with range operations without constants.  For more info on using range optimization see mysql documentation and for indexing see “Mastering the Art of (MySQL) Indexing“.

I did not find any article providing some insights into doing DW/BI analysis on IP addresses.  In a DW environment the business user is interested in slicing and dicing the hundreds of thousands to million of IPs.  Just trying to find few thousand IP address locations (See query 1 below) was taking more than an hour when joined between lookup table (a table similar to dim_geo_ip_city_location but without partition column (ip_range_partition) and id) and ip address join table even with appropriate indexes like index on (start_ip_num, end_ip_num) or start_ip_num and end_ip_num.

Query 1:

-- Simple query to select IP block
SELECT IA.ip_num, B.location_id, B.start_ip_num, B.end_ip_num
FROM ip_addresses IA use index(...)
JOIN dim_ip_block B use index(...)
WHERE IA.ip_num between B.start_ip_num and B.end_ip_num

This leads to MySQL performing full scans. And in the case of ip_block table (B) it doesn’t use any index. For this to be avoided, I partitioned the dimension (lookup) table into 1000 parts and since this table has ~5.9 Million rows each partition will result in ~5,900 rows in each partition.  For more on data set partitions see its wiki.  The partitions should be numbered in the ascending order with respect to ascending range of ip blocks (start_ip_num).  Then also create a another table (dim_ip_range_partition) with start_ip_num and end_ip_num for this each range basically creating super IP blocks.

CREATE TABLE dim_ip_range_partition (
     partition_id int NOT NULL AUTO_INCREMENT,
     start_ip_num bigint NOT NULL,
     end_ip_num bigint NOT NULL,
     ....
) ENGINE=InnoDB ;

Also, create a primary key on patition_id and index on (start_ip_num, end_ip_num) on this table.
The solution is based on two things that there is no overlapping IP block and IP blocks are ordered (in ascending order).

To populate the above table,

  1. Add id (row_id) column to dim_ip_block
  2. Add partition_id integer column to dim_ip_block
  3. Update id from 1 to n (highest start_ip_num) in ascending order of start_ip_num
  4. Partition “n” ids into P parts (n/P) where n > 1 and 1 < P < n
  5. Assign each part/partition ascending integer value
  6. Now update above table dim_ip_range_partition P_TABLE from dim_ip_block B_TABLE where P_TABLE partition id between 1 and P of B_TABLE and P_TABLE.start_ip_num is lowest B_TABLE.start_ip_num of the corresponding partition and P_TABLE.end_ip_num is highest of corresponding partition.  In other words…

         INSERT INTO dim_ip_range_partition
            (partition_id, start_ip_num, end_ip_num, …)
         SELECT partition_id, min(start_ip_num), max(end_ip_num), …
         FROM dim_ip_block
         GROUP BY partition_id
         ORDER BY partition_id asc

By doing so, we have created a super IP blocks each having n/P blocks from the original ip_blocks.   And the following query is one way to extract city/country info for given set of IP addresses.

SELECT t.ip_num, t.ip, B.start_ip_num, B.start_ip, B.end_ip_num, B.end_ip
,L.city, L.country
FROM
(
   --- Get the super ip block (partition)
   SELECT C.ip_num, max(partition_id) partition_id, max(P.start_ip_num) start_ip_num
   FROM ip_addresses C use index(...)
   JOIN dim_ip_range_partition P
   WHERE C.ip_num >= P.start_ip_num
   GROUP BY C.ip_num
) t
LEFT OUTER JOIN mktg_ip_block B use index(...) 
  ON t.partition_id = B.partition_id     --- And join on this partition
  AND t.ip_num >= B.start_ip_num
  AND t.ip_num <= B.end_ip_num
JOIN dim_geo_ip_city_location L        --- Get city/country
  ON B.location_id = L.location_id

The query that would have taken more than hour returns within 30 seconds now!  You can further optimize the above design with say more denormalized relations, having a large number of partitions (this is a trade-off, with highest partition n = P, each partition is same as original table which will result in worse off performance!) on both lookup table and lookedup table.  Or extending the partitioning concept to more levels (super super block, ….) you can reduce the number of joining rows with little bit more complex query.

The solution is a typical Computer Science way of handling the complex problem by dividing into smaller ones and solving it.  In this it is more specific to MySQL solution for large data sets with range operations.

Cheers,
Shiva

Special characters in table names

In DW environment, we allow business user to upload data into new tables and run their own adhoc queries when needed. The user interface allows the user to upload up to 10,000 (10K) rows in either comma or tab delimited format to a temporary database/table. The user can specify the table name they want the data to upload to. Many savvy business users can thus run their own SQL queries joining with the data in DW tables.

All this works fine but many a times user provides table names with non-alphanumeric characters including space, forward or backward slash, dollar sign, etc. At some point DBAs noted that managing these tables becoming an issue especially the tables names with special characters. The following monitoring tool helped alleviate or eliminated the problem and also the UI was modified to check for any special characters and remove them before creating users’ temporary tables.

Since MySQL doesn’t provide regular expression (RLIKE or REGEXP) in SHOW TABLE syntax one is forced to use UNIX or other language (awk, perl, python, etc.) for RE features.

The following pulls all table names with special characters including $, \, space, ?, and @. You can add other characters that are of interest to you to this set.

linux> mysql --host=hostname_or_ip  --database=db_name   --user=root  --password=your_pw  --execute="SHOW TABLES" | egrep [$\\\ ?!@]

# This shows tables like (not including double quotes)
"non_alphanum_table_name test it 2"
"non_alphanum_table_name; test it"
"non_alphanum_table_name; test it $"
"non_alphanum_table_name; test it $ \"
"non_alphanum_table_name_test_$_!_2"

Here you have choice either delete those tables or store the name of these tables into a separate table called “ERROR_TABLE_NAME” and work with business users to later modify them. Option 2 is friendly and professional and you won’t annoy users like in option 1 :).

Quickly zap an awk script like

{ print "INSERT INTO ERROR_TABLE_NAME(name) VALUES('" $0 "'); COMMIT;" }

to insert each table name into ERROR_TABLE_NAME table.

You need to store the script in a file (say, error_tbl_name.awk) since at command line passing a single quote in the above string doesn’t work. When you pass a table name to MySQL, you need to quote the name like ‘non_alphanum_table_name test it 2’. But single quote is used by awk which was not easily escaped and errored out.

Now piping all three commands above will result in

linux> mysql --host=hostname_or_ip  --database=db_name   --user=root  --password=your_pw  --execute="SHOW TABLES" | egrep [$\\\ ?!@] | awk -f error_tbl_name.awk
-- The INSERT statements generated for each looks like
INSERT INTO ERROR_TABLE_NAME(name) VALUES('non_alphanum_table_name test it 2'); commit;
INSERT INTO ERROR_TABLE_NAME(name) VALUES('non_alphanum_table_name; test it'); commit;
INSERT INTO ERROR_TABLE_NAME(name) VALUES('non_alphanum_table_name; test it $'); commit;
INSERT INTO ERROR_TABLE_NAME(name) VALUES('non_alphanum_table_name; test it $ \\'); commit;
INSERT INTO ERROR_TABLE_NAME(name) VALUES('non_alphanum_table_name_test_$_!_2'); commit;

Also, note you have commit after each insert, due to fact that the auto commit is turned off and when piping is done mysql command is executed for each table which will generate error otherwise.

As final step execute

linux> mysql --host=hostname_or_ip  --database=db_name   --user=root  --password=your_pw  --execute="SHOW TABLES" | egrep [$\\\ ?!@] | awk -f error_tbl_name.awk |  mysql --host=hostname_or_ip  --database=db_name   --user=root  --password=your_pw

Change these table names to new standard table names and send an email to business users with list of above table names.

There you have it.
HTH,
Shiva