There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example – either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.
In the example url http://http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; http://www.hiregion.com and sem-google-aswords-api-data-retriever.html.
Other posts that may be of interest:
1. Regular expressions – An intro
2. 99-bottles of bear in many languages
import time
import re
t2 = time.time()
for ii in range(1000000): #-- Parse it million times...
m = re.search(r'http://(.*?)/\d{4}/\d{2}/(.*$)', 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
#- Saved to m.groups
t3 = time.time()
print 'regex took:', t3 - t2, 'seconds'
2. Using Python urlparse module
import re
from urlparse import urlparse
t2 = time.time()
for ii in range(1000000): #-- Parse a million times...
parsed = urlparse('http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
file = re.sub('/\d{4}/\d{2}/', '',parsed.path)
#- Saved to parsed.netloc and file
t3 = time.time()
print 'urlparse took:', t2 - t1, 'seconds'
3. Using Perl Regular Expression
Regular expression is simple and same as the one used above in Python.
#!/cygdrive/c/perl/bin/perl
use strict;
use Time::HiRes qw(gettimeofday tv_interval);
my $str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.htm';
my $th1 = [gettimeofday()];
for my $ii (1..1000000) { #-- Parse a million times
$str =~ m#http://(.*?)/\d{4}/\d{2}/(.*$)#;
#- Saved to $1 and $2
}
my $th2 = [gettimeofday()];
print "tv_interval", tv_interval($th1), "\n";
If you would like to transform url after loading the data into staging tables (say, after the bulk load), the following one can be used.
4. Using MySQL (5.1 version)
SET @str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html' ;
PREPARE select_q FROM
'SELECT CASE
WHEN ? REGEXP ''^http://'' then substring_index(?, ''/'', -1)
ELSE NULL end file,
CASE
WHEN ? REGEXP ''^http://'' then substring_index(substring_index(?, "/", 3), ''/'', -1)
ELSE NULL end sub_domain
' ;
EXECUTE select_q USING @str, @str, @str, @str;
# Output...
# file sub_domain
# sem-google-adwords-api-data-retriever.html www.hiregion.com
With Perl (5.10.1) and Python (2.6) running on the same system, perl took 0.999 seconds, Python regular expressions took 2.641 seconds and urlparse took 6.5 seconds.
Cheers,
Shiva
~
~
There are lots of information about latest technology, like Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. This information seems to be more unique and interesting. Thanks for sharing.
Big Data Training in Chennai | Hadoop Course in Chennai | Big Data Training in Chennai
Excellent post!!! Your article helped to under the future of java development. Being an open source platform, java is integrated in most of the software development industries to create rich featured applications. Java Course in Chennai | Best JAVA Training in Chennai
Hadoop is one of the best tool which is used to handle the big data in the IT industy and it is the fastest growing field in information technology.
hadoop training in Chennai | hadoop training chennai
Excellent post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
salesforce training in chennai | salesforce training institute in chennai
Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
Node JS training in chennai | Node JS training institute in chennai
Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.
Hadoop Training in Chennai | Hadoop Training Chennai
Excellent content. Thanks for sharing content which is very useful that provided me the required information.
Cloud Computing Training in Chennai | Cloud Computing Courses in Chennai
This is my first visit to your blog, your post made productive reading, thank you. dot net training in chennai