DW solution – IP mapping to Geo Locations

In my previous few posts on the blog, I covered how to convert IP addresses to corresponding integers and back. When performing a Data Warehouse (DW) tables join between IP addresses for location analysis joining on integer IP column is much faster than joining on string column.  As noted in other posts, this article covers IPv4 but technique could be used for IPv6 addresses on using 2 BIGINT.

For geo location info, you can use MaxMind’s Geo-City data that is provided in two entities – one with mapping IP block to a location ID and another one with location ID to city/state/country.  It is one to many (1:M) relationship and to find an IP address location one need to find the matching IP block and then map it to city or country.  There are posts that describe in detail how to perform it effeciently for a given IP.  These techniques will work for an OLTP environment where you pass known IP address from UI or application layer to database and gets its city/country.

  1. Fast MySQL Range Queries on MaxMind GeoIP Tables byAndy Skelton
  2. MaxMind GeoIP Install Setup Tutorial by Bartomedia
  3. Finding a user’s country/region based on IP by Shlomi Noach
  4. On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS by Jeremy Cole

Geo-City data has following fields seperated by comma and row terminated by new line (\n):

  • Location ID
  • Country Code
  • State/Region
  • City Name
  • Postal Code (US only)
  • Latitude
  • Longitude
  • Metro Code
  • Area Code

IP Block has:

  • Start IP Address
  • End IP Address
  • Location ID

While loading this dimensional data during ETL process, you should also do the IP address transformation to integer as explained in “Converting an IP address to an integer“.  And Geo-City dimensional table will also have partition range (the reason for this post) and surrogate key (id).  You should also perform other data validation like non-overlapping ranges, valid country codes, etc. which are all typical standard activities of DW ETL process.  The time, resource and effort put into cleaning and validating the data before pushing the data into final fact tables will pay well within short period of time.  To fix a data issue and reconcile the reports in DW/BI environment always leads to lower ROI.

You can also denormalize the above tables and create a single table.  In either case by joining on range of IP address for a given set of IPs will result in the same slow performance.

dim_geo_ip_city_location holds around 5.9million rows with following attributes:

CREATE TABLE dim_geo_ip_city_location (
location_id int NOT NULL,
    start_ip varchar(16) NOT NULL,
    end_ip  varchar(16) NOT NULL,
    country char(2),
    region varchar(64),
    city varchar(128),
    postal_code varchar(32),
    latitude decimal(8,4),
    longitude decimal(8,4),
    metro_code varchar(8),
    area_code varchar(8),
...
)ENIGNE=...

CREATE TABLE dim_ip_block (
    start_ip_num bigint NOT NULL,
    end_ip_num bigint NOT NULL,
    location_id int NOT NULL,
        ...
)ENGINE=...

In MySQL (5.0) optimizer performs range operation effeciently with a constant meaning that you have given the query an IP address like “WHERE ip_addr between 1230 and 1234”, for example, but not with range operations without constants.  For more info on using range optimization see mysql documentation and for indexing see “Mastering the Art of (MySQL) Indexing“.

I did not find any article providing some insights into doing DW/BI analysis on IP addresses.  In a DW environment the business user is interested in slicing and dicing the hundreds of thousands to million of IPs.  Just trying to find few thousand IP address locations (See query 1 below) was taking more than an hour when joined between lookup table (a table similar to dim_geo_ip_city_location but without partition column (ip_range_partition) and id) and ip address join table even with appropriate indexes like index on (start_ip_num, end_ip_num) or start_ip_num and end_ip_num.

Query 1:

-- Simple query to select IP block
SELECT IA.ip_num, B.location_id, B.start_ip_num, B.end_ip_num
FROM ip_addresses IA use index(...)
JOIN dim_ip_block B use index(...)
WHERE IA.ip_num between B.start_ip_num and B.end_ip_num

This leads to MySQL performing full scans. And in the case of ip_block table (B) it doesn’t use any index. For this to be avoided, I partitioned the dimension (lookup) table into 1000 parts and since this table has ~5.9 Million rows each partition will result in ~5,900 rows in each partition.  For more on data set partitions see its wiki.  The partitions should be numbered in the ascending order with respect to ascending range of ip blocks (start_ip_num).  Then also create a another table (dim_ip_range_partition) with start_ip_num and end_ip_num for this each range basically creating super IP blocks.

CREATE TABLE dim_ip_range_partition (
     partition_id int NOT NULL AUTO_INCREMENT,
     start_ip_num bigint NOT NULL,
     end_ip_num bigint NOT NULL,
     ....
) ENGINE=InnoDB ;

Also, create a primary key on patition_id and index on (start_ip_num, end_ip_num) on this table.
The solution is based on two things that there is no overlapping IP block and IP blocks are ordered (in ascending order).

To populate the above table,

  1. Add id (row_id) column to dim_ip_block
  2. Add partition_id integer column to dim_ip_block
  3. Update id from 1 to n (highest start_ip_num) in ascending order of start_ip_num
  4. Partition “n” ids into P parts (n/P) where n > 1 and 1 < P < n
  5. Assign each part/partition ascending integer value
  6. Now update above table dim_ip_range_partition P_TABLE from dim_ip_block B_TABLE where P_TABLE partition id between 1 and P of B_TABLE and P_TABLE.start_ip_num is lowest B_TABLE.start_ip_num of the corresponding partition and P_TABLE.end_ip_num is highest of corresponding partition.  In other words…

         INSERT INTO dim_ip_range_partition
            (partition_id, start_ip_num, end_ip_num, …)
         SELECT partition_id, min(start_ip_num), max(end_ip_num), …
         FROM dim_ip_block
         GROUP BY partition_id
         ORDER BY partition_id asc

By doing so, we have created a super IP blocks each having n/P blocks from the original ip_blocks.   And the following query is one way to extract city/country info for given set of IP addresses.

SELECT t.ip_num, t.ip, B.start_ip_num, B.start_ip, B.end_ip_num, B.end_ip
,L.city, L.country
FROM
(
   --- Get the super ip block (partition)
   SELECT C.ip_num, max(partition_id) partition_id, max(P.start_ip_num) start_ip_num
   FROM ip_addresses C use index(...)
   JOIN dim_ip_range_partition P
   WHERE C.ip_num >= P.start_ip_num
   GROUP BY C.ip_num
) t
LEFT OUTER JOIN mktg_ip_block B use index(...) 
  ON t.partition_id = B.partition_id     --- And join on this partition
  AND t.ip_num >= B.start_ip_num
  AND t.ip_num <= B.end_ip_num
JOIN dim_geo_ip_city_location L        --- Get city/country
  ON B.location_id = L.location_id

The query that would have taken more than hour returns within 30 seconds now!  You can further optimize the above design with say more denormalized relations, having a large number of partitions (this is a trade-off, with highest partition n = P, each partition is same as original table which will result in worse off performance!) on both lookup table and lookedup table.  Or extending the partitioning concept to more levels (super super block, ….) you can reduce the number of joining rows with little bit more complex query.

The solution is a typical Computer Science way of handling the complex problem by dividing into smaller ones and solving it.  In this it is more specific to MySQL solution for large data sets with range operations.

Cheers,
Shiva

Tools to convert IP address

Following tool converts an ip address to other formats. For more on IP address conversion check out part 1 and 2.


Converting an IP address to an integer and reverse (Part 1)
IP Address conversions (Part 2)


IP address converter:

Enter IP:
(e.g. 192.168.100.200 or 3232261320 etc)
Powered by: WebToolHub.com

http://www.webtoolhub.com/plugin.axd

You can also check out 2’s complement of a number here. This will be handy when one needs to check for signed numbers (2’s complement).

http://www.wolframalpha.com/widget/widget.jsp?id=52cf49fea5ff66588408852f65cf8272

IP Address conversions

This is the part 2 of converting an IP address to an integer or vice versa. For more details and little introdution see Part 1.

Converting an ip address to integer and back is little straight forward in MySQL compared to SQL Server since SQL Server (2008) doesn’t provide bit shift operators.

In short you can’t do,

SELECT 1<<2 

nor

SELECT 3232261320 -- IP '192.168.100.200'
* (256*256*256*256*256*256) -- Left shift 6 bytes
/ (256*256*256*256*256*256*256) -- Right shift 7 bytes to get 2nd octet

One will encounter “Arithmetic overflow error converting expression to data type int”. Even casting each integer to bigint will not solve the problem as the multiplication results in overflow of signed bigint. And there is no unsigned bigint in SQL Server.

Another way to solve this is to do the following in SQL Server by using bit AND. Retain only the byte that is off interest and then right shift the bits appropriately.

#-- IP '192.168.100.200'
SELECT (cast(3232261320 as bigint) & cast(4278190080 as bigint))/(256*256*256);
SELECT (cast(3232261320 as bigint) & 16711680)/(256*256);
SELECT (cast(3232261320 as bigint) & 65280)/(256);
SELECT (cast(3232261320 as bigint) & 255);

#- And the output will be
192
168
100
200

On systems with unsigned bigint you can also do (MySQL)

SELECT cast(3232261320*256*256*256*256 as unsigned)
/ (256*256*256*256*256*256*256) 4th_octet;

SELECT cast(3232261320*256*256*256*256*256*256*256 as unsigned)
/ (256*256*256*256*256*256*256 as unsigned) as 1st_octet;

They will result in outputs 192 and 200 respectively.

With given ip you can simply run the following to get its integer value,

mysql toad>  
SELECT ( ((substring_index(ip, '.',1))<<24)
+ ((substring_index(substring_index(ip, '.', 2), '.', -1))<<16)
+ ((substring_index(substring_index(ip, '.', 3), '.', -1))<<8)
+ ((substring_index(substring_index(ip, '.', 4), '.', -1)<<0))
)
FROM
( SELECT '192.168.100.200' as ip
UNION ALL
SELECT ip FROM som_ip_table;
) t;

Cheers,
Shiva

Converting an IP address to an integer and reverse

In dealing with IP address analysis one may required to map IPs to geo location. There are third party applications or software that perform IP to city, state, country mapping and businesses, for example, can make use of them to see where their web traffic is coming from. 

Article covers IPv4 addresses which use 32 bits and similar technique can be used to IPv6 addresses which use 128 bits by using 2 BIGINTs to store (64bits each).

It is easy to handle reasonable load when one is querying for few IPs or few hundred IPs either from simple user interface entering one or more IP addresses or through APIs. In these cases IP addresses can simply be matched in string format without much of performance issue but it can become a performance bottleneck when dealing with millions of rows loaded through ETL process into Data Warehouse environment and reports are run against fact tables with hundreds of millions of rows. Under these circumstances converting IP to an integer will boost the performance substantially.

MySQL provides functions to convert IP string to integer and backwards. For example, inet_aton converts string to a number (integer) and inet_ntoa converts number to integer.

inet_aton and inet_ntoa:

mysql> select inet_aton('192.168.100.200');
+------------------------------+
| inet_aton('192.168.100.200') |
+------------------------------+
| 3232261320 |
+------------------------------+

mysql> select inet_ntoa(3232261320);
+-----------------------+
| inet_ntoa(3232261320) |
+-----------------------+
| 192.168.100.200 |
+-----------------------+
1 row in set (0.00 sec)


Bit operations:

The functions are simple to use and at the lower level they basically perform bit shifting. In the above example, ‘192.168.100.200’ is represented as 11000000 10101000 01100100 11001000 in binary form. For more on binary system or to convert a number from decimal to binary see wiki here.

To convert string IP to integer function inet_aton multiplies 192 by 256*256*256, 168 by 258*256, 100 by 256 and then adding 200 (last octet in the above ip address), that is,

and it is same as 192<<24 i.e., left shift 192 by 24 bits, 168 << 16 and 100 << 8. Bit shifting is a much faster multiplication since CPU can use its register to move the bit in simple operation.

In reverse, one can not simply divide the ip integer (3232261320) to get back the ip address in octet string format.  But one need to appropriately left and right shift the number (or do multiply and then divide) to get the octet that is of interest. To get the most highest valued position (192), shift the bits to right. For example,

11000000 10101000 01100100 11001000 >> 24 (right shift 24 bits) gives you

00000000 00000000 00000000 11000000

Therefore, all the right most 24 bits dropped off the register (cliff) and we are left with 4th octet 11000000 prefixed with 24 zeros.

mysql> SELECT 3232261320>>24 as 4th_octet;
+-----------+
| 4th_octet |
+-----------+
| 192 |
+-----------+
1 row in set (0.00 sec)

By the same logic to get the 3rd octet, we can move the bits 16 times to the right but we will be left with 4th octet as well. Doing

11000000 10101000 01100100 11001000 >>16 will result in

00000000 00000000 11000000 10101000

So we should first shift the bits to left so that the most highest octet (4th octet) drops off the register.  We can do so by shifting left by 8 bits.

mysql> SELECT 3232261320<<8 as 8blsn;  -- 8blsn: 8 bit left shift number
+--------------+
| 8blsn |
+--------------+
| 827458897920 |
+--------------+
1 row in set (0.00 sec)

Instead of getting a smaller number than before we got a larger number! And obviously, right shifting this number by 16bits to right will not give the 3rd octet of IP.   This happens due to fact that MySQL converted int to bigint silently.  The 4th octet moved into 5th octet resulting in appending 8 zeros to the right as below.

11000000 10101000 01100100 11001000 00000000

Since bigint is 8 bytes long one need to shift not by 1 byte (8bits) but 5 bytes (40 bits) to delete the highest octet (#192).

Just with 4 byte (32 bit) left shift the register would look like below.

11000000 10101000 01100100 11001000 00000000 00000000 00000000 00000000

And with 40 bit left shift the register would look like thus removing (#192 – 11000000 ),

10101000 01100100 11001000 00000000 00000000 00000000 00000000 00000000

Now right shift 56 bits to get the 3rd octet (#168).

00000000 00000000 00000000 00000000 00000000 00000000 00000000 10101000

This is same as
(3232261320 *256 *256 *256 *256 *256) / (256 *256 *256 *256 *256 *256 *256)

 mysql> SELECT 3232261320><>56;
+--------------------+
| 3232261320<>56 |
+--------------------+
| 168 |
+--------------------+
1 row in set (0.00 sec)

To get the 2nd octet first left shift the original number 48 bits and then right shift by 56 bits. For the 1st octet left shift by 56 bits both ways.

mysql> SELECT 3232261320<>56;
+--------------------+
| 3232261320<>56 |
+--------------------+
| 100 |
+--------------------+
1 row in set (0.00 sec)

mysql> SELECT 3232261320<>56;
+--------------------+
| 3232261320<>56 |
+--------------------+
| 200 |
+--------------------+
1 row in set (0.00 sec)

With all imperative/procedural languages providing the bit operators (AND, OR, bit shift, etc), one can and should convert the IP string to integer before being loaded to staging. Thus the parsing or conversion can also be easily parallel processed for large loads.

An example in C sharp (C#) is shown below.

// 
// C# program to convert IP address string to an integer.
//
using System;

namespace IP_addr_to_int
{
class Program
{
static void Main(string[] args)
{
// Infinite loop until user enters control-c.
for (; ; ) {
string ip = "";
Console.Write("Enter ip address: ");
ip = Console.ReadLine();

ulong ipNum = 0;
string[] ipArr = ip.Split('.');

if (ipArr.Length != 4) {
Console.WriteLine("Error. Enter four octet IP address only.");
// Environment.Exit(-1);
continue;
}
for (int ii = ipArr.Length - 1; ii >= 0; ii--)
{
ulong ipOctet = Convert.ToUInt64(ipArr[3 - ii]);
// ipOctet * 256 * 256 * 256;
if (ii == 3) { ipNum += ipOctet << 24; }
// ipOctet * 256 * 256;
else if (ii == 2) { ipNum += ipOctet << 16; }
// ipOctet * 256;
else if (ii == 1) { ipNum += ipOctet << 8; }
// ipOctet * 1
else { ipNum += ipOctet }
// System.Threading.Thread.Sleep(1000);
}
Console.WriteLine
(Convert.ToString(ip)+ " converted to integer " + Convert.ToString(ipNum));
}
}
}
}

For more ways to convert an IP see Part 2  and for tools see Part 3.
Cheers,
Shiva