There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example – either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.
In the example url http://http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; http://www.hiregion.com and sem-google-aswords-api-data-retriever.html.
Other posts that may be of interest:
1. Regular expressions – An intro
2. 99-bottles of bear in many languages
import time
import re
t2 = time.time()
for ii in range(1000000): #-- Parse it million times...
m = re.search(r'http://(.*?)/\d{4}/\d{2}/(.*$)', 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
#- Saved to m.groups
t3 = time.time()
print 'regex took:', t3 - t2, 'seconds'
2. Using Python urlparse module
import re
from urlparse import urlparse
t2 = time.time()
for ii in range(1000000): #-- Parse a million times...
parsed = urlparse('http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
file = re.sub('/\d{4}/\d{2}/', '',parsed.path)
#- Saved to parsed.netloc and file
t3 = time.time()
print 'urlparse took:', t2 - t1, 'seconds'
3. Using Perl Regular Expression
Regular expression is simple and same as the one used above in Python.
#!/cygdrive/c/perl/bin/perl
use strict;
use Time::HiRes qw(gettimeofday tv_interval);
my $str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.htm';
my $th1 = [gettimeofday()];
for my $ii (1..1000000) { #-- Parse a million times
$str =~ m#http://(.*?)/\d{4}/\d{2}/(.*$)#;
#- Saved to $1 and $2
}
my $th2 = [gettimeofday()];
print "tv_interval", tv_interval($th1), "\n";
If you would like to transform url after loading the data into staging tables (say, after the bulk load), the following one can be used.
4. Using MySQL (5.1 version)
SET @str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html' ;
PREPARE select_q FROM
'SELECT CASE
WHEN ? REGEXP ''^http://'' then substring_index(?, ''/'', -1)
ELSE NULL end file,
CASE
WHEN ? REGEXP ''^http://'' then substring_index(substring_index(?, "/", 3), ''/'', -1)
ELSE NULL end sub_domain
' ;
EXECUTE select_q USING @str, @str, @str, @str;
# Output...
# file sub_domain
# sem-google-adwords-api-data-retriever.html www.hiregion.com
With Perl (5.10.1) and Python (2.6) running on the same system, perl took 0.999 seconds, Python regular expressions took 2.641 seconds and urlparse took 6.5 seconds.
Cheers,
Shiva
~
~