Yahoo SEM daily report – Currency and Dups handling

If you are pulling daily keyword performance data from major search engines to store in your in Data Warehouse, you need to perform many of transformation before having a daily, weekly or monthly summaries available for business (end) users.  In the case of Yahoo the required transformations that I needed to make are, one to handle the currency conversion and two to handle Yahoo sending more than one keyword when only one is expected. This has seem to happen when query or landing page url is modified for the same bid unit and for high impressions queries. Also, for disabled queries/keywords.

The following sql handles both with data read from staging table and loaded into final table after some transformation and cleaning. It does currency conversion for all countries except US and Mexico (units are in US dollars) which do not need conversions.

Sub-query below uses MySQL GROUP_CONCAT to find any duplicate keywords for the bid unit and dim_curreny has the daily currency exchange rates pulled from external third party services early in the morning. se_query_group table has the grouping of the keywords based on company requirements and search engine. Query also filters out any keyword that has zero impressions.

The query can further be modified for better performance of using query_id by joining with data warehouse query tables and using query_id in the joins also using search engine bid unit as the surrogate key.

(se_date, search_engine, account_id, campaign_id, ad_group_id, keyword_id
,domain, url, query_group_id
,query, query_num_2
,impressions, clicks, ctr
,cpc_orig, cpc
,revenue_orig, revenue
,cost_orig, cost
,avg_position )
SELECT KP.se_date, KP.search_engine, KP.account_id, KP.campaign_id
,KP.ad_group_id, KP.keyword_id,
,KP.domain, KP.url, KP.query_group_id
WHEN query2 NOT RLIKE '---' THEN query2 # Normal single kw
WHEN query2 RLIKE '---' THEN substring_index(query2, '---', 1) # Get 1st kw
END query2
WHEN query2 RLIKE '---' THEN substring_index(substring_index(query2, '---', 2), '---', -1)
END query_num_2 # Get 2nd query
,cpc cpc_orig
WHEN C.country_code not in ('US', 'MX') THEN cpc / C.currency_rate
ELSE cpc
END cpc
,revenue revenue_orig
WHEN C.country_code not in ('US', 'MX') THEN revenue / C.currency_rate
ELSE revenue
END revenue
,cost cost_orig
WHEN C.country_code not in ('US', 'MX') THEN cost / C.currency_rate
ELSE cost
END cost
WHEN avg_position > 0 and avg_position < 1 THEN 1
ELSE avg_position
END avg_position
SELECT se_date, search_engine
,account_id, campaign_id, ad_group_id, keyword_id,
,domain, url, query_group_id, query
,GROUP_CONCAT(query SEPARATOR '---') query2
,SUM(impressions) impressions, SUM(clicks) clicks
,AVG(ctr) ctr, AVG(cpc) cpc, 
,AVG(revenue) revenue, 
,SUM(cost) cost, AVG(avg_position) avg_position
FROM stg_keyword_perf
WHERE se_date >= '' #-- Date here
GROUP BY se_date ,account_id ,campaign_id ,ad_group_id ,keyword_id
) KP #- stg_kw_perf
JOIN se_query_group QG
ON KP.query_group_id = QG.query_group_id AND KP.search_engine = QG.search_engine
JOIN dim_currency C
ON KP.se_date = C.currency_date AND QG.country_code = C.country_code
WHERE KP.se_date = '' #-- Date here
AND KP.impressions 0


SEM – Google Adwords API Data Retriever

If you are running many large campaigns or thousands of campaigns and want to track keyword performance of Google Adwords, the following application helps retrieve the data on regular intervals (daily, weekly, monthly, etc.). Using your data warehouse and reporting infrastructure you can slice and dice the data based on your business requirements.

Application is a multi-threaded implementation of pulling data through API. Each thread gets a campaign id to pull keyword performance report.  The code included here is a skeletal code and you need to add error handing, code to suit your database schema and some ETL.  In my environment I have used MySQL 5.0, linux (2.6) and Windows, Perl 5.10.1 and Google Adwords API v13.  You can also use Google’s client libraries available in many languages including Java, Python, DotNet, Perl, PHP and others.

Other posts that may be of interest:
1. Google Analytics Data Retriever
2. Bulk data loading using Talend
3. Google Keyword Performance data pull

Campaign Class:

package Campaign;

# Campaign Package

use strict;
use DBI;

sub new {
my $self = {};
$self->{campIds} = []; #-- Master campaign ID array


return $self;

sub _populateCampaignIds {
my $self = shift;

#- Connect to database to get list of campaigns (IDs).
#- my $dbh = DBI->connect($data_source, $user_name, $pw);
#- my $sth = $dbh->prepare(
#- "SELECT campaign_id from campaign
#- WHERE search_engine = 'google'
#- AND status = 'active';
#- ";
#- $sth->execute() or die "Unable to execute a query.\n";
#- while ( @id = $sth->fetchrow_array ) {
#- push (${$self->{campIds}}, $id[0]);
#- }
#- $dbh->disconnect;

#---- For demo, I have hard-coded some IDs.
push( @{$self->{campIds} }, '1111');
push( @{$self->{campIds} }, '2222');
push( @{$self->{campIds} }, '3333');

return 1;

#- Returns master list of campaign IDs
sub getCampaignIds {
my $self = shift;

return @{ $self->{campIds} };

#- Returns chunk of campaign IDs from master array.
sub getCampaignIdChunk {
my $self = shift;
my $chunks = shift;
my $index = shift;

if (!defined $index and !defined $chunks) {
return undef;

my $chunkSize = int (scalar(@ {$self->{campIds}})/$chunks);
if ($chunkSize < 1) {
print "Error: no campaigns to process for thread $index\n";
return undef;
my $lIx = $chunkSize * ($index - 1); #-- low index
my $hIx = $chunkSize * ($index); #-- high index

my @campIds = ();
if ($index < $chunks) {
@campIds = splice (@ {$self->{campIds}}, $lIx, $chunkSize);
} else {
#- print "Getting LAST chunk for $index from lower $lIx element and a chunk size of $chunkSize\n";
@campIds = splice (@ {$self->{campIds}}, $lIx);

return @campIds;


Prototype application:

# Adwords Data Retriever
# This is skeleton application to pull Adwords Keywords Performance Report Data
# from Google through Adwords API.
# Uses multi-threaded programming technique for simultaneous pull of many campaigns.
# Campaign ids in the application are hard-coded with dummy values and hence replace them by connecting
# to your database and getting actual campaign ids (code shown).
# Also, for authentication use your user name, password, developer & application code.
# Valid for Google Adwords v13.
# Author: Shiva M.

#- use lib "Add directory where you installed";
use strict;
use Benchmark;
use Getopt::Std;
use Data::Dumper;

use XML::Simple;
use LWP::Simple;
use Crypt::SSLeay;
use SOAP::Lite;
use IO::Uncompress::Gunzip qw();

use threads;
use threads::shared;

use Campaign;

my (%opts, @worker) = undef;
my $NUM_THREADS = 3; #-- Modify this to suit your environment. Or better use Config module with ini file to setup application configuration.
my $GOOGLE_API = '';

# Main program.

if ($opts{'s'} == 1) {
for (my $i = 1; $i < ($NUM_THREADS+1); $i++) {
print "MASTER: Starting thread " . $i, "\n";
$worker[$i] = threads->new(\&pullCampaignData, $i);
sleep 5;

while(anyRunningThread()) {
for (my $ii = 1; $ii < scalar(@worker); $ii++) {
my $thr = $worker[$ii];
if ($thr->is_joinable) {
my $ret = eval { $thr->join() };
print "Thread $ii returned: $ret\n";

exit (0);

# Subroutines.
sub getOpts {
if (!getopts('s:d:', \%opts)) {
print "Invalid option. Exiting\n";
exit (-1);

return 1;

#- Any thread still running?. Returns Yes/No (1/0)
sub anyRunningThread {
my $runThrs = threads->list(threads::running);
print "Still $runThrs threads are running.\n";
if ($runThrs) { return 1; }

return 0;

#- Each thread to pull keyword performance reports for a list of campaigns
sub pullCampaignData {
my ($tId) = @_;
my $thr = $worker[$tId];

my $to_dollar_conv_factor = 1 /(1000 * 1000); #-- Currency in micros. $1 1000 * 1000;

my $log = "thread_$tId.log";
open (TFH, ">$log") or die "Unable to open file $log\n"; #-- Log file for the thread.

#- Get this thread share of load - list of campaign IDs
my $campaign = Campaign->new();
my @tCampIds = $campaign->getCampaignIdChunk($NUM_THREADS,$tId);

print TFH "\npullCampaignData \$tId: $tId: campaignIDs:", map {"$_ "} @tCampIds;

my @headers = (SOAP::Header->name('email' => $email),
SOAP::Header->name('password' => $pw),
SOAP::Header->name('useragent' => $ua_name),
SOAP::Header->name('developerToken' => $developer_token),
SOAP::Header->name('applicationToken' => $application_token)

my $url = sprintf('', 'adwords');
my $wsdl = $url . '?wsdl';
my $service = SOAP::Lite->service($wsdl)->autotype(0)->readable(1)->proxy($url);
print TFH "\ndebug:service=$service" if $opts{d};

sub {
my $response = $::ARG[1];
die('The following SOAP fault occurred:', "\n", 'faultcode: ',
$response->faultcode(), "\n",
'faultstring: ', $response->faultstring(),
SOAP::Lite->import(+trace => 'debug') if $opts{d};

# Make SOAP call.
# Get one campaign data at a time. Generate a file to bulk load into database.

foreach my $cId (@tCampIds) {

my $AWJobParams = {
'selectedReportType' => 'Keyword',
'name' => 'AW Keyword Perf Daily Report',
'aggregationTypes' => ['Daily'],
'startDay' => $yesterday_date,
'endDay' => $yesterday_date,
'selectedColumns' => [ qw (
KeywordId Keyword CampaignId CampaignStatus
AdGroupId AdGroupStatus AdGroupMaxCpa AdWordsType
Impressions Clicks CTR CPC Conversions ConversionRate Cost
AveragePosition QualityScore AverageConversionValue BottomPosition
CPM CostPerConverstion CostPerTransaction DailyBudget DefaultCount
DefaultValue FirstPageCpc KeywordDestUrlDisplay KeywordStatus
KeywordTypeDisplay LeadCount LeadValue MaxContentCPC MaximumCPC
MaximumCPM PageViewCount PageViewValue SalesCount SalesValue
TopPosition TotalConversionValue Transactions ValuePerClick
ValuePerCost CustomerName ExternalCustomerId )
'crossClient' => 1,
'campaigns' => [$cId]

# Step 1. Define a job
my $report_job = SOAP::Data->name('job' => $AWJobParams);
$report_job->attr({'xsi:type' => 'DefinedReportJob', 'xmlns' => "$GOOGLE_API"});

# Output XML file after unzipped.
my $output = "${cId}_kw_perf.xml";

# Catch any exception
eval {

# Step 2. Validate report.
$service->call('validateReportJob' => $report_job, @headers);
print TFH "debug: about to schedule report \n";

# Step 3. Schedule report.
my $job_id = $service->call('scheduleReportJob' => $report_job, @headers)->result();
$job_id = SOAP::Data->name('reportJobId' => $job_id);

# Step 4. Get report status. Wait until finish or error.
my $status = $service->call('getReportJobStatus' => $job_id, @headers)->result();
while ($status ne 'Completed' and $status ne 'Failed') {
print TFH "Report job status is $status\n";
$status = $service->call('getReportJobStatus' => $job_id, @headers)->result();
if ($status eq 'Failed') {
die("Report job generation failed.\n");

# Step 5. Get report URL to download the file.
my $report_url = $service->call('getGzipReportDownloadUrl' => $job_id, @headers)->result();
print TFH "Report is available at $report_url\n" if $opts{d};

# Step 6. Get the data file (report).
my $gzip_report_data = LWP::Simple::get($report_url);

# Step 6a. Save it to local file.
my $localFile = "${cId}_kw_perf.xml.gz";
open(LFH, ">$localFile") or die "Could not open $localFile: $!\n";
print LFH $gzip_report_data . "\n";
close LFH;

# Step 7. Unzip the file and build xml hash.
IO::Uncompress::Gunzip::gunzip(\$gzip_report_data => $output);

my $xml_hash = XMLin($output);
print TFH Dumper($xml_hash) if $opts{d};

# Step 8. Create database bulk load file.
open (BLFH, ">${cId}_bulk_load_input.txt") or die "Unable to create bulk load file: $!\n";

#-- Uncomment the fields that are of interest to you.
foreach my $row (@{$xml_hash->{table}->{rows}->{row}}) {
print BLFH
$row->{keywordid},"\t", $row->{keyword},"\t",
# $row->{campaign},"\t",
$row->{campStatus},"\t", $row->{adgroupid},"\t",
# $row->{adgroup},"\t",
# $row->{agstatus},"\t",
# $row->{agMaxCpa} * $to_dollar_conv_factor,"\t",
# $row->{adwordsType}, $row->{keywordid}, $row->{keyword}, $row->{kwStatus},
# $row->{customerid}, $row->{acctname}, $row->{kwDestUrl},
$row->{imps},"\t", $row->{clicks},"\t", $row->{ctr},"\t",
# $row->{cpc} * $to_dollar_conv_factor, $row->{conv},
# $row->{convRate},
$row->{cost} * $to_dollar_conv_factor,"\t",
# $row->{avgConvValue} * $to_dollar_conv_factor,
# $row->{bottomPosition}, $row->{cpm} * $to_dollar_conv_factor,
# $row->{costPerConv} * $to_dollar_conv_factor,
# $row->{cpt} * $to_dollar_conv_factor,
# $row->{budget} * $to_dollar_conv_factor, $row->{defaultConv},
# $row->{defaultConvValue} * $to_dollar_conv_factor,
# $row->{firstPageCpc} * $to_dollar_conv_factor,
$row->{kwDestUrl},"\t", $row->{kwType},
# $row->{leads}, $row->{leadValue} * $to_dollar_conv_factor,
# $row->{maxContentCpc} * $to_dollar_conv_factor,
# $row->{maxCpc} * $to_dollar_conv_factor,
# $row->{maxCpm} * $to_dollar_conv_factor, $row->{pageviews},
# $row->{pageviewValue} * $to_dollar_conv_factor,
# $row->{purchases},
# $row->{purchaseValue} * $to_dollar_conv_factor,
# $row->{topPosition},
# $row->{convValue} * $to_dollar_conv_factor,
# $row->{transactions}, $row->{convVpc},
# $row->{valueCostRatio},

if ($@) {
print("Report job failed validation due to the following error: ", $@, "\n");
exit (-1);

return 1;

Bulk loading into data warehouse staging and perform summarization.

#- Psuedo code
#- 1. Cleanup or perform any needed transformation of data in the application.
#- 2. Concatenate all the campaign keyword performance data to a file(s).
#- 3. Bulk load file into a staging table.
#- 4. Transform and summarize the data through SQL.


Google NS Individual Search Box

Today, while searching for “cygwin” keyword I was presented with results that caught my eye.  The first result had its own search box.  See picture below.  Had anyone seen this before?  I liked the feature and pretty handy if you are looking for a specific information from a specific site. I entered “latest version” and hit “Search” that showed search entry “latest version” in the next search. Nice.

Other posts that may be of interest:
1. Best Google home page logos
2. Better quicker SEM ROI
3. 99 bottles of beer in different laguages


Forecasting SEM ROI with Eureqa

There is a is powerful tool to detect equations and hidden relationships among the data sets.  What makes it irresistible tool is its cost – FREE!.  It is Eureqa.  In this blog I will explain how to find relationship between cost and long term revenue for a keyword campaign in Google Adwords.  This is a simplied version of identifying a relationship at macro level and then drilling down further to identify more details.

The sample I have taken is over two months summarized data at daily level for a campaign.  Since cost and revenue are affected by other factors including the keywords or ad words that comprised a campaign, competitors’ bidding, landing page or page optimizations, etc., it is important to note that the following is to get a quick insight into how cost is related to revenue for a particular campaign.  This is one of important arsenal out of few to get better positive ROI.

The data set is shown below in the application’s tab 1(Enter Data).  This very much looks like Excel spreadsheet!.  First column is cost and second is long term revenue.

Related Posts: Generating quick positive SEM ROI

Numbers above (from 1 to 5) points to different tabs I refer to in the article.  Once the data is loaded into respective columns, name the cost column as x and revenue column y and then move to tab 2 (Smooth Data). You can enable smoothing and in the lower panel you will see y vs x curve.  In tab 3 (Pick Modeling Task), select typical operations like subtract, multiple, divide, power, exponential and logarithm that will be used in creating the formula.  I typically leave out sine, cosine and absolute value.

Also, it is important to note that for some reason the application default formula selection is x = f(y).  But in a typical situation the formula is written as y = f(x).  So if you select x as the independent variable and y as dependent variable, once after the application provides the final formula you can derive y from x.  Or in the above tab 1 (Enter Data) reverse x and y which is much simpler (option 2).  Here I will use option 1.

Click on tab 4 (Start Search) and then on “Start” button.  This will begin the evaluation and churns through lot for formulaes to fit the data and you can see the running performance in the five windows.  As it progresses, running for few minutes, you will see the curve fitting the data in tab 5 (Solution Statistics).  With more dispersed data, you can expect complicated formula to be generated. You can pick any formula that is of interest to you and evaluate.  See below with tab 4 and 5.

The formula here I choose for illustration purpose is
f(y) = 0.31y – 557.42, that is x = 0.31y – 557.42 and so to get y, I would do

I crossed checked the formula in Excel sheet and the summed revenue (over two months) was difference of -3.4%.  And running it again at some other time, I got slightly different formula f(y) = 0.31y – 519.16. With this the above difference reduced to -1.18%. Pretty good.

So, with this tool you have got the formula that you can use to do some forecasting or predictive analysis! And you can use the similar technique to find relationships among other dimensions that are of interest to you.

Shiva M.