URL parsing in Python, Perl, MySQL

There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example – either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.

In the example url http://http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; http://www.hiregion.com and sem-google-aswords-api-data-retriever.html.


Other posts that may be of interest:
1. Regular expressions – An intro
2. 99-bottles of bear in many languages


import time
import re




t2 = time.time()
for ii in range(1000000): #-- Parse it million times...
m = re.search(r'http://(.*?)/\d{4}/\d{2}/(.*$)', 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
#- Saved to m.groups
t3 = time.time()

print 'regex took:', t3 - t2, 'seconds'




2. Using Python urlparse module


import re
from urlparse import urlparse

t2 = time.time()
for ii in range(1000000): #-- Parse a million times...
parsed = urlparse('http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
file = re.sub('/\d{4}/\d{2}/', '',parsed.path)
#- Saved to parsed.netloc and file

t3 = time.time()
print 'urlparse took:', t2 - t1, 'seconds'

3. Using Perl Regular Expression


Regular expression is simple and same as the one used above in Python.

#!/cygdrive/c/perl/bin/perl

use strict;
use Time::HiRes qw(gettimeofday tv_interval);

my $str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.htm';
my $th1 = [gettimeofday()];

for my $ii (1..1000000) { #-- Parse a million times
$str =~ m#http://(.*?)/\d{4}/\d{2}/(.*$)#;
#- Saved to $1 and $2
}

my $th2 = [gettimeofday()];

print "tv_interval", tv_interval($th1), "\n";

If you would like to transform url after loading the data into staging tables (say, after the bulk load), the following one can be used.

4. Using MySQL (5.1 version)


SET @str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html' ;

PREPARE select_q FROM
'SELECT CASE
WHEN ? REGEXP ''^http://'' then substring_index(?, ''/'', -1)
ELSE NULL end file,
CASE
WHEN ? REGEXP ''^http://'' then substring_index(substring_index(?, "/", 3), ''/'', -1)
ELSE NULL end sub_domain
' ;
EXECUTE select_q USING @str, @str, @str, @str;

# Output...
# file sub_domain
# sem-google-adwords-api-data-retriever.html www.hiregion.com

With Perl (5.10.1) and Python (2.6) running on the same system, perl took 0.999 seconds, Python regular expressions took 2.641 seconds and urlparse took 6.5 seconds.

Cheers,
Shiva

~
~

Regular Expressions

MySQL regular expressions (RE) are a powerful tools that can be very useful in SQL string searches. They enable software engineer to build statements that are very concise and to handle complex string operations that otherwise wouldn’t be possible.

If you are new to regular expressions or would like to know more about them, http://www.regular-expressions.info/ is a good site to visit. You can also get a RE tutorial at net|tuts+. Following is quick list of meta characters that can get you started in using them.
. => A dot matches single character
* => An asterisk matches zero or more of previous matched tokens
? => A question mark matches zero or one time the previous matched token
$ => A dollar at the end anchors the search to the end of string
^ => A caret symbol anchors the search to the beginning of the string
| => A pipe matches either of the two. Example: abc|xyz => either ‘abc’ or ‘xyz’
{m,n} => A quantifier matching between ‘m’ and ‘n’ times. m & n are integers.
Different computer languages have some variations when it comes to more advanced searches and how they handle given character sets. MySQL uses REGEXP string function to implement and matches the string in case “in-sensitive” mode and to match otherwise see this blog.

It is important to note that MySQL REGEXP returns 1 or 0 depending on whether it finds the match in the pattern or not. If a match that your looking for in the pattern or the pattern itself is null, then it returns NULL. Other application languages may return an array or list or true/false. Thus use of REGEXP in where clause is more appropriate than in either SELECT or others.

For example:

SELECT 'AdWords' REGEXP 'A*W*';    

returns 1

Trying the following on MySQL Sakila database:

SELECT * FROM country  WHERE country REGEXP '^U.*' ;

and it returns ‘Ukraine, United Arab Emirates, United Kingdom & United States’; all the country names that begin with ‘U’ and followed by 0 (zero) or more characters.
For demonstration purpose, let us assume that we would like to get a name of the European explorer whose first name starts with ‘M’ and ends with ‘O’. Also, first and last name are not normalized and stored in single column.

SELECT name FROM (
SELECT substring_index(name, ' ', 1) REGEXP '^M.*O' YN, name
FROM (
SELECT 'Christopher Columbus' name
UNION
SELECT 'Marco Polo' name
UNION
SELECT 'Vasco Da Gama' name
UNION
SELECT 'Francisco Pizarro' name
) x
) y
WHERE YN = 1

‘^M.*O’ pattern matches any expression that begins (^) with ‘F’ and may have any character (.) zero or more times (*) and ending with ‘D’. The result is Marco Polo.
There are also many special constructs one can use to handle group of specific characters or positions in the string, etc. Below are the ones I end up using more often than others.

  • [[:<:]]token[[:>:]]

Searches the token in the string that is neither preceded nor followed by any alpha-numerals or “_” (underscore). For example,

SELECT 'Here is a token in a string' REGEXP '[[::]]'

Returns 1 (true), where as the following ones return 0 (false)

SELECT 'Here is a _token in a string' REGEXP '[[::]]'
SELECT 'Here is a tokens in a string' REGEXP '[[::]]'

  • [:character_class:]
This would match any group of characters that belong to predefined character class. Few examples of classes are “alpha” – alpha characters, “digit” – digit characters, “space” – any space, tab, newline, etc.

SELECT column1, length(column1) from table
WHERE column1 REGEXP '[[:space:]]' ;

will return only those column1 and its length if they have any space embedded in them – quite helpful when you want to identify that any space character has crept in when it is not suppose to.

REGEXP’s complex searches come with a caveat of the way the search is done at the low levels – bits and bytes. Both RLIKE and REGEXP work in byte-wise fashion and can result in unexpected results when used on multi-byte strings. For any LATIN character set one will be safe to use them but for UNICODE need to be careful.

Hope that helps,
Shiva