URL parsing in Python, Perl, MySQL

There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example – either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.

In the example url http://http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; http://www.hiregion.com and sem-google-aswords-api-data-retriever.html.

Other posts that may be of interest:
1. Regular expressions – An intro
2. 99-bottles of bear in many languages

import time
import re

t2 = time.time()
for ii in range(1000000): #-- Parse it million times...
m = re.search(r'http://(.*?)/\d{4}/\d{2}/(.*$)', 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
#- Saved to m.groups
t3 = time.time()

print 'regex took:', t3 - t2, 'seconds'

2. Using Python urlparse module

import re
from urlparse import urlparse

t2 = time.time()
for ii in range(1000000): #-- Parse a million times...
parsed = urlparse('http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
file = re.sub('/\d{4}/\d{2}/', '',parsed.path)
#- Saved to parsed.netloc and file

t3 = time.time()
print 'urlparse took:', t2 - t1, 'seconds'

3. Using Perl Regular Expression

Regular expression is simple and same as the one used above in Python.


use strict;
use Time::HiRes qw(gettimeofday tv_interval);

my $str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.htm';
my $th1 = [gettimeofday()];

for my $ii (1..1000000) { #-- Parse a million times
$str =~ m#http://(.*?)/\d{4}/\d{2}/(.*$)#;
#- Saved to $1 and $2

my $th2 = [gettimeofday()];

print "tv_interval", tv_interval($th1), "\n";

If you would like to transform url after loading the data into staging tables (say, after the bulk load), the following one can be used.

4. Using MySQL (5.1 version)

SET @str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html' ;

WHEN ? REGEXP ''^http://'' then substring_index(?, ''/'', -1)
ELSE NULL end file,
WHEN ? REGEXP ''^http://'' then substring_index(substring_index(?, "/", 3), ''/'', -1)
ELSE NULL end sub_domain
' ;
EXECUTE select_q USING @str, @str, @str, @str;

# Output...
# file sub_domain
# sem-google-adwords-api-data-retriever.html www.hiregion.com

With Perl (5.10.1) and Python (2.6) running on the same system, perl took 0.999 seconds, Python regular expressions took 2.641 seconds and urlparse took 6.5 seconds.



8 thoughts on “URL parsing in Python, Perl, MySQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s