In couple of my earlier posts I talked about loading data pulled from search engines and then loading into staging tables of a data warehouse. You can load data in couple of ways including databases bulk load tools – for example, MySQL “LOAD DATA INFILE ” statement or using Talend’s ETL studio or similar tools (Pentaho, Informatica, etc.)
In this article, I will be loading a simple data file with more than 200,000 (200K) rows of keyword performance data for illustration purpose and the setup used was
MySQL 5.1 (Community version)
Talend Open Studio (3.2.2)
The data flows through 6 components (tFileInputDelimited, tPerlRow, tSortRow, tFileOutputDelimited, tFilterRow and tMysqlOutput) starting at input data file read and transformed into another output file and then loaded into database. Component #3 (sorting component) is not needed but shown here to stress the ease with which a developer can quickly put together a ETL process. Palette tab on the right hand side houses these components and many more. If you use Java instead of Perl, you will likely have more components to use and for list of all available components take at Talend Forge.
Input record format is a tab seperated fields with one record per line and so is output file in component 4.
# Typical fields in the file are date, account_id, campaign_id, ad_group_id, keyword_id, CPC, CTR, Cost, Impressions, url etc.
# 2009-12-12 9876543210 34987650 1147382910 1234567890 0.07
# 0.226632 247.98 15444 http://hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html
I provide the following regular expression to parse each URL to extract domain name in component 2.
And this parses any URL with correct formats like http://www.amazon.com or https://amazon.com or http://www.amazon.co.uk or amazon.com and more.
Component 4 sorts the input stream coming from the regexp output and dumps into a output file which is then read into a filter which filters out any record that doesn’t have the primary key before loading into staging database. You can load into database without having the primary key constraint to further speed it up and cleanup the duplicates before pushing into either dimensions or facts tables.
The whole run took little over a minute (~63 seconds) running on a test system running all three (MySQL, Talend, Perl). For daily load, you can make use of scheduler to kick-off the job and automate the loading process.
Hope that helps,