Compiling Python Redshift DB adapter for AWS Lambda env.

AWS lambda has gained huge momentum in the last couple of years and enabled software architects/ developers to build FaaS (Function as a Service).  As much as Lambda helps in scaling applications, it has some limitations like execution duration or memory space availability, etc.   For long running jobs, typically in the backend or batch processing, 5 minute duration can be a deal breaker.  But with appropriate data partitions and architecture it is still an excellent option for enterprises to scale their applications and be cost effective.

In the recent project, I architected data be loaded from a datalake into Redshift.  The data is produced by an engine in batches and pushed to s3.  The data partitioned on time scale and a consumer Python application will load this data at regular intervals into Redshift staging environment.  For scalable solution datalake can be populated from multiple producers and similarly one or more consumers can drain the datalake queue to load to Redshift.  The data from multiple staging tables are then loaded to final table after deduping and data augmentation.

Read More »

Simple puzzle and memory usage

Recently there was a simple puzzle in one of my whatsapp group.  It was
if 1 => 8, 2 =>  18,
   3 =>  30, 4 =>  44, 5 =>  60, then what is 9?

It was simple to calculate by hand and provide the answer.  You can also formulate the above deduction and use it for any number with

6 + (M + (2 x n))
    where n is set of integers; 0 < n < N
    and M is cumulative sum from previous step with M(initial value) is 0

This problem lends itself well to reduction function and with the given formula it was easy to check it for any integer n.  Out of curiosity created a function and was called in Python for-loop for each iteration while monitoring load on CPU and memory tracked. Then I used reduce function with xrange to see the same.

# Python

>>> a = []; M = 0

>>> for n in range(0, 100000000):
...        M = 6+M+(2*n)
...        a.append(M)

# [ CPU was loaded above 97% with memory usage jumping and ended up killing the process ]



>>> print reduce(lambda x,y: 6+x+(2*y), xrange(0,100000000))

10000000499999994

# with xrange, got the result in less than few seconds though CPU usage jumped above 97% but no discernible change in memory.  

Visualizing data processed consciously in a moment

Recently Google released a training material to its employees as part of how unconscious bias can happen in work place.  For more details see –
http://www.businessinsider.com/google-unconscious-bias-training-presentation-2015-12  which refers to research paper at http://rd.springer.com/chapter/10.1007%2F978-3-642-82598-9_3

It is very interesting and one of the slide (see below) specifically caught my attention with respect how huge amount of data is NOT processed consciously.   That is we can only handle 40 bits for every 11 million bits in a second!

That is just 1 part for every 275,000 parts!!

As interesting as it is the impact I thought will be even more dramatic when visualized it in some way.  Following are of couple of attempts at it using Python and ImageMagick.

Each tiny square (green or red) in the following picture represents 40 bits and each row has 275 of those squares. With 1000 rows (10 blocks of 100 rows each) of those we get 11 Million bits of data represented.  This is just for one second of duration!

Just imagine the the scale for one’s life time (80 years) data. All those 275,000 blocks above are to be repeated 2,522,880,000 times!!!

Another way is tiny string, say 5 pixels long, represents 40 bits then total length of 11M bits will be 275k * 5 pixels long!  See below.  Small green line at the center of the diagram is a string of few pixels and around which the really long string i.e., 275K times long string is wound around a square peg. At each of 25 rounds the string changes color for better visibility. Totally there are 742 sides.

at end of 25th round the length is 5,050 times longer and
at end of 50th round it is 20,100 times
at end of 100th round it is 80,200 times
at end of 150th round it is 180,300 times
and finally at 185.25 round it is 274,111 times longer

Note: Keep in mind it is not the size (area) of the squares but the total perimeter length of all these squares that represents ratio of data.

NASA Orion Journey To Mars Data Analysis

I have always been very much interested in Physics and enjoy reading related books, articles or watch shows like Carl Sagan’s Cosmos, Freeman’s Through the Wormhole, etc.  For that matter this site name *hiregion.com* is derived from H-I-Region (Interstellar cloud).

When I saw NASA’s – “Send Your Name on NASA’s Journey to Mars, Starting with Orion’s First Flight”, I was excited to put my family, relatives and friends’ names along with few charity names.  The names will be placed on a microchip of Orion’s test flight on Dec. 4, 2014 that orbits around the Earth and on future journey to Mars!  Following quote from the NASA site:

Your name will begin its journey on a dime-sized microchip when the agency’s Orion spacecraft launches Dec. 4 on its first flight, designated Exploration Flight Test-1. After a 4.5 hour, two-orbit mission around Earth to test Orion’s systems, the spacecraft will travel back through the atmosphere at speeds approaching 20,000 mph and temperatures near 4,000 degrees Fahrenheit, before splashing down in the Pacific Ocean.

But the journey for your name doesn’t end there. After returning to Earth, the names will fly on future NASA exploration flights and missions to Mars.

More info at

Orion test flight uses big boy Delta IV (biggest expendable launch system) and Orion after orbiting earth twice will reenter and splash down in Pacific ocean.

Courtesy NASA/ Wikipedia.org

Some of sample boarding passes:

By the time the entries were closed, I think it was on Oct.31, there were nearly 1.4million (1,379,961 exactly) names and the top countries were United States, India and United Kingdom by count with people from almost all countries having submitted their names.  For more details see http://mars.nasa.gov/participate/send-your-name/orion-first-flight/world-participation-map/ .  Bar chart below shows the same info.

Though US, India and UK were the top three by number of names submitted I was curious to know how countries did when adjusted for population size, GDP and area (sq. miles).  With that in mind I pulled NASA data and country data from the following web sites.

Built a quick Python script to do data pull, join country data and perform some minor calculations.  The code is located here at Gist or see end of this post.

Running through few R scripts and clustering them based on each country’s

  • Orion passenger count/ 10K people
  • Orion passenger count/ 1K sq. miles
  • Orion passenger count/ Billion $ GDP

and then normalized through R scale for cluster selection.  Optimal cluster seem to be 7 or 8. Monaco and Singapore are major outliers due to skew that happened with their small geographical area (sq. miles). See below – Monaco is that single dangler at the top right and Singapore/ Hungary are at bottom right but above rest of other countries.

Scatter plot shows much more clearly the two countries standing out especially in the middle tiles below – passengers_per_1K_sq_miles vs other two metrics ( passengers_per_10K_population and passengers_per_1Billion_gdp).

And after removing those two countries from the data frame and clustering again results in the following:

That is an interesting cluster.  Countries that had highest entries adjusted for population, GDP, geo size Hungary tops the list! Maldives, Hong Kong, UK and Malta take other top 4 places.  Quick normalized scores look like:

country Score(/Pop.) Score(/Area) Score(/GDP) Score_ABS
Hungary 5.783493976 1.560361327 4.485219257 11.82907456
Maldives 0.715814116 4.784567704 4.43908513 9.939466951
Hong Kong -0.217141885 7.8493819 -0.59223565 8.658759434
United Kingdom 3.957774546 2.869764313 1.288187419 8.115726277
Malta 1.085016478 5.903919255 0.393610721 7.382546454
Bangladesh -0.195758981 1.116466958 4.697494631 6.00972057

Cluster (optimal) size analysis:

It is always fun playing around with different ways to slice and dice data and below bubble world map shows simple metric of passengers count for each billion dollar GDP.

Top 5 countries, in this case, are

Bangladesh 133.95982
Hungary 128.75381
Maldives 127.62238
Philippines 125.95591
Kosovo 106.8

It will be more interesting to see how the numbers relate with each country’s science and technology budget.  I will try doing it in next few days as some of the data is already available in the wild.  In ideal world there should be good percent of the yearly budget allocated to Science & Tech.

Data pull Python code:

#!/Users/shiva/anaconda/bin/python
# -*- coding: utf-8 -*-

import os
import sys
import re
import locale
import pprint
import scraperwiki
from bs4 import BeautifulSoup
from collections import defaultdict


class NasaData():
nasa_file_path = "/tmp/nasa_orion_reg_by_country.txt"
ctry_file_path = "/tmp/countrycode_org_data.txt"
nasa_site = "http://mars.nasa.gov/participate/send-your-name/orion-first-flight/world-participation-map/"
ctry_site = "http://countrycode.org/"
metrics_file_path = "/tmp/nasa_metrics_by_country.txt"

def __init__(self):
pass


def get_nasa_entries():
'''
Scrape NASA Orion participants count by country data
Ouptput to file nasa_orion_reg_by_country.txt
Args: None
'''

html = scraperwiki.scrape( NasaData.nasa_site )
soup = BeautifulSoup( html )

out_file = NasaData.nasa_file_path
if os.path.exists( out_file ) and os.path.getsize( out_file ) > 10:
print "Warning: " + out_file + " exists. Continuing without scraping NASA data.\n"
return False

countries = soup.find( 'ul', class_='countryList' )
with open( out_file, 'wt' ) as fh:
for country in countries.findAll('li'):
c_name = country.find('div', class_='countryName').text
c_num = country.find('div', class_='countNumber').text.strip()
# line = c_name + "," + c_num + "\n"
line = ''.join([c_name, ',', c_num, '\n'])
fh.write(line)

return True


def get_country_details():
'''
Scrape countrycode data including population, gdp, area, etc.
Dump output to file countrycode_org_data.txt
Args: None
'''

html = scraperwiki.scrape(NasaData.ctry_site)
soup = BeautifulSoup(html)

out_file = NasaData.ctry_file_path
if os.path.exists( out_file ) and os.path.getsize( out_file ) > 10:
print "Warning: " + out_file + " exists. Continuing without scraping COUNTRY_CODE data.\n"
return False

cnty_table = soup.find( lambda tag: tag.name == 'table' and tag.has_attr('id') and tag['id'] == "main_table_blue" )
countries = cnty_table.findAll( lambda tag: tag.name == 'tr' )
with open( out_file, 'wt' ) as fh:
for country in ( countries ):
cnty_str = '|'

cnty_attr = country.findAll( lambda tag: tag.name == 'th' )
if ( cnty_attr ):
for attr in ( cnty_attr ):
cnty_str += attr.contents[0] + "|"
else:
cnty_attr = country.findAll( lambda tag: tag.name == 'td' )
if ( cnty_attr ):
for ix, val in ( enumerate(cnty_attr) ):
if ix == 0:
cnty_str += val.findAll( lambda tag: tag.name == 'a' )[0].string + "|" # Get country name
else:
cnty_str += val.contents[0].strip() + "|" # Get country attrs

# print cnty_str
fh.write( cnty_str + "\n" )

return True


def join_country_data():
'''
Join two data sets by country name and write to file nasa_metrics_by_country.txt
country names and its metrics
Args: None
'''
fh = open( NasaData.metrics_file_path, 'wt' )
# Country names lowercased, removed leading "The ", removed leading/trailing and extra spaces
nasa_data = defaultdict(list)
cc_org_data = {}

for line in open( NasaData.nasa_file_path, 'rt' ):
ln_els = line.strip('\n').split(',')
ln_els[0] = ln_els[0].lower()
ln_els[0] = re.sub(r'(^[Tt]he\s+)', '', ln_els[0])
ln_els[0] = re.sub(r'(\s{2,})', ' ', ln_els[0])
nasa_data[ln_els[0]].append(ln_els[1]) # orion_vote appended

# nasa_data dict appended with country data. key:country => values[orion_votes, pop., area, gdp]
for l_num, line in enumerate( open( NasaData.ctry_file_path, 'rt') ):
# line: |Afghanistan|AF / AFG|93|28,396,000|652,230|22.27 Billion|
if l_num == 0: continue # Skip header

ln_els = line.strip('\n').split('|')
ln_els[1] = ln_els[1].lower()
ln_els[1] = re.sub(r'(^[Tt]he\s+)', '', ln_els[1])
ln_els[1] = re.sub(r'(\s{2,})', ' ', ln_els[1])

# Strip out comma in pop(element 4) and area (5)
nasa_data[ln_els[1]].append( ln_els[4].translate(None, ',') ) # pop appended
nasa_data[ln_els[1]].append( ln_els[5].translate(None, ',') ) # area appended

# Normalize gdp to millions
gdp = re.match( r'(\d+\.?\d*)', ln_els[6] ).group(0)
gdp = float(gdp)
if re.search( r'(Billion)', ln_els[6], re.I ):
gdp = gdp * 1000
elif re.search( r'(Trillion)', ln_els[6], re.I ):
gdp = gdp * 1000000
nasa_data[ln_els[1]].append( gdp ) # gdp appended


# TODO: Some country names are not standard in NASA data. Example French Guiana is either Guiana or Guyana
# Delete what is not found in country code data or match countries with hard coded values


locale.setlocale(locale.LC_ALL, '')
for cn in sorted(nasa_data): # country name
# array has all nasa_votes, pop., sq miles, gdp and has pop > 0 and gdp > 0. Capitalize name.
if len(nasa_data[cn]) > 3 and int(nasa_data[cn][1]) > 0 and int(nasa_data[cn][3]) > 0:
l = ( cn.title() + ":" + nasa_data[cn][0]
+ ":" + locale.format( '%d', int(nasa_data[cn][1]), 1 ) # pop
+ ":" + str( round( float( nasa_data[cn][0] ) * 10000/ int(nasa_data[cn][1]), 5 )) # per 10K pop
+ ":" + locale.format( '%d', int(nasa_data[cn][2]), 1 ) # area
+ ":" + str( round( float( nasa_data[cn][0]) * 1000 / int(nasa_data[cn][2]), 5 )) # per 1K sq mile
+ ":" + locale.format( '%d', int(nasa_data[cn][3]), 1 ) # gdp
+ ":" + str( round( float( nasa_data[cn][0]) * 1000 / nasa_data[cn][3], 5 )) # per Billion $ gdp
+ "\n"
)
fh.write(l)

return True



if __name__ == "__main__":
get_nasa_entries()
get_country_details()
join_country_data()
exit( 0 )

World Cup Data Analysis For Fun – Part II

Continuing from Part I ( http://www.hiregion.com/2014/06/world-cup-data-analysis-for-fun-i.html ), following chart shows density of number of goals scored by country in a world cup tournament.  Black line in the fore ground is the average density of goals.

Some interesting facts:
* Purple peak is Wales with four goals in 1958 and that is the only year they played.
* Organge-yellowish peak is Bolivia scoring no goals twice and one goal once
* Large percentage (~80%) score no more than 10 goals in each tournament

Goals For Summary (per country per cup):

  • Min. :        0.0
  • 1st Qu.:     2.0
  • Median :   4.0
  • Mean :      5.7
  • 3rd Qu.:    8.0
  • Max. :     27.0
Goal Against Summary (per country per cup):

  • Min.   :     0.0
  • 1st Qu.:    4.0
  • Median :  5.0
  • Mean   :   5.7                                                                         
  • 3rd Qu.:   7.0  
  • Max.   :  17.0
While it is low number of goals scored in a each world cup (see chart above) it is also interesting to see the trend over many decades of all goals (scored + allowed) per game.  Here I applied the LOWESS (locally weighted scatter plot smoothing) non-parametric regression to better fit the data (blue line).

  

Though early in early years there were lot more goals each game, in the recent past (after 1970) it has stabilized around 2.7 goals per game.  But how do soccer power houses (Argentina, Brasil, Germany, etc.) compare with seven other countries chosen from another cluster (See part 1).  As one would expect you have to score more than you allow 🙂 and represented by gray dashed line on Y-axis i.e,

Goals Scored / Goals Allowed > 1

The colored year shows the winner of the World Cup on that year while the size of the bubble shows the total goals (Scored plus Allowed).  Six countries won all world cups between 1930 and 2006 except for the years 1930 and 1950 when Uruguay won and there were no world cups during 1942, 1946.

The outlier you see at the left top screen (BR, 10) is when Brazil scored 10 goals but allowed only 1 goal in 1986 in 5 matches while Argentina was the world cup winner scoring 14 goals and allowing 5 goals in 7 matches.

And the bottom (US, 0.14) big dot is for when US scored 1 goal and allowed 7 goals in 1934.

Your genome data through API

Recently 23andme.com reduced their prices of DNA test kit to $99 and now you get your hereditary and other related information for price less than $100!  That is a big drop from where it started $999 and then $299.  I know little about genome/ genomics but it is a very interesting, dynamic, fast growing field which has potential to change the way we view health (one now is empowered to know whether they are at any risk of potential congenital diseases now or in the future ) or one’s ancestry!

My interest was in the data that you can play with.  With the API in place you can pull your or demo data.  To do that first I needed to setup data pull through API and following is quick summary of setup I had on my mac.

Some important links:
23andme API
OAuth Introduction
OAuth flow

After creating a developer login account  you can set up the application/client with given credentials – client_id and client_secret.  See below.

For quick testing and pull, I used Dancer web framework on local mac and 23andme uses OAuth2 with 3 legged authentication.  As a first step get the ‘code’ by creating a simple page with link to their login page.

For exmaple link below takes the user to authorize your client and once successfully user logins the authentication dance happens between the client and server.

"a api.23andme.com="" authorize="" href="http://www.blogger.com/" https:="" redirect_uri="http://localhost:5000/receive_code/&response_type=code&client_id=YOUR_CLIENT_ID&scope=basic" rs123="">"Connect with 23andMe. [pid: $$]";

Note: pid above is a process id for me to know when I killed and restarted the Dancer.

User clicks on the link

and then login to authorize your client to access to her resources.  This is the point where ‘code’ is received and exchanged for access_token.

After successful OAuth dance now you can call any of end-points ( https://api.23andme.com/docs/reference/ ).  Here is demo user call ( https://api.23andme.com/1/demo/user/ )

Download Google keywords through SUDS

There are sometimes situations where you may want to download all the keywords (criteria) uploaded to Google Adwords and if you have substantial keyword basket (say, more than a million), then either downloading manually or checking through the tools provided by Google become prohibitive.  Also, Google daily keyword performance report provides metrics for all the keywords that have had at least one impression.

Following application is a simple tool that downloads all keywords and dumps that data to different files.    I have removed error handling code and hardcoded many values to simplify it and you can modify this your requirements and don’t forget to add your account and API details.  I have been able to download few million keywords using multiple processes.

Environment:
Ubuntu karmic OS running Python 2.6.4 for Adwords API version 13 (V13) using Python SUDS SOAP module.

#-- Python application to pull all your campaign keywords.
#-- Uses suds SOAP module to connect to Google Services and writes to files (one for each client email).
#-- Get list of all client emails (typically multiple accounts created to handle large set of keywords)
#-- For each account get its campaigns
#-- For each campaign get its adgroups
#-- For each adgroup get all keywords
#-- print ids and keyword attributes

import sys
import time
import re

import traceback as tb
import suds.metrics as metrics

from suds import WebFault
from suds.client import Client

#-- A class to handle user details...
#--------------------------------------
class GoogleSoapUser:
def __init__(self):
self.user = {'email' : 'your_developer_api_email_addr',
'password' : 'your_developer_password',
'useragent' : 'Google Adwords KW pull V0.1',
'developerToken' : 'your_developer_token',
'applicationToken' : 'your_application_token'
}
#-- Note: connecting to production (adwords); for sandbox testing replace it.
self.wsdl = {'campaign' :'https://adwords.google.com/api/adwords/v13/CampaignService?wsdl',
'adGroup' :'https://adwords.google.com/api/adwords/v13/AdGroupService?wsdl',
'keyword' :'https://adwords.google.com/api/adwords/v13/CriterionService?wsdl'
}

def getUser(self):
return self.user

def getWSDL(self, k=None):
return self.wsdl[k]

def getClientEmails(self): #-- Client emails to pull data for
emails = ['your_campaign_ac_email_addr_1',
'your_campaign_ac_email_addr_2',
'your_campaign_ac_email_addr_3'
]
return emails

def setHeaderEmail(self, clientEmail=None):
self.user['clientEmail'] = clientEmail


#-- Main program
#------------------------------
if __name__ == '__main__':
gUser = GoogleSoapUser()

#-- For each client email...
for cEmail in gUser.getClientEmails():
#-- Open a output file to dump...
print "Checking account: ", cEmail, "\n"
file = 'google_' + re.split('@', cEmail)[0] + '.txt'
try:
f = open('/tmp/'+file, 'w')
except IOError as err:
print "Unable to open file", file, format(err)

#-- Set the SOAP header with client email
gUser.setHeaderEmail(cEmail)

#-- Get all campaigns...
campClient = Client(gUser.getWSDL('campaign'))
campClient.set_options(soapheaders=gUser.getUser())
campaigns = campClient.service.getAllAdWordsCampaigns(0) #-- Pass a dummy number...

#-- For each campaign ...
for campaign in campaigns:
campaignID = campaign['id']
print "Checking for campaign: ", campaignID, "\n"
adClient = Client(gUser.getWSDL('adGroup'))
adClient.set_options(soapheaders=gUser.getUser())
adGroups = adClient.service.getAllAdGroups(campaignID)

#-- For each AdGroup...
for adGroup in adGroups:
adGroupID = adGroup['id']
print "Checking for adGroup: ", adGroupID, "\n"
kwClient = Client(gUser.getWSDL('keyword'))
kwClient.set_options(soapheaders=gUser.getUser())
keywords = kwClient.service.getAllCriteria(adGroupID)

#-- For each keyword...
for kw in keywords:
f.write(str(campaignID) + '\t' +
str(adGroupID) + '\t' +
str(kw['id']) + '\t' +
kw['type'] + '\t' +
kw['text'] + '\t' +
str(kw['qualityScore']) + '\t' +
kw['destinationUrl'] + '\n'
)
# sys.exit()

#-- Close this account output file...
f.close()
print "Data pull complete for account ", cEmail, "\n"

Cheers,
Shiva

fizzbuzz in Ruby, Python, Perl, more

Recently, I ran into couple of articles on codinghorror.com  that covered interviews with programmers/ software engineers.  Many of the commentators answered the programming (fizzbuzz) question using their favorite computer language like VBA, Ruby, Perl, etc.  Since the post was no more allowing any new comments, I thought of posting them here on my blog itself.

fizzbuzz question: For all numbers between 1 and 100, print ‘fizz’ any number divisible by 3, ‘buzz’ for number divisible by 5 and ‘fizzbuzz’ for number divisible by both 3 & 5.

Article 1:  Why Can’t Programmers.. Program?
Article 2: The Non-Programming Programmer

I have used ‘map’ function to solve it and mapping function provide nice way to work through lists. There are few other ways you can solve it as well including the ones already given in the original articles above.

Perl:
sub fb {
if ($_ % 15 == 0) { return "fizzbuzz\n" }
elsif ($_ % 3 == 0) { return "fizz\n" }
elsif ($_ % 5 == 0) { return "buzz\n" }
return "$_\n";
}

print map {fb($_)} (1..100)

Ruby:

You can also use select method

(1..100).map { |n| if (n%15 == 0):   puts 'fizzbuzz'
elsif (n%5 == 0): puts 'buzz'
elsif (n%3 == 0): puts 'fizz'
elsif puts n
end
}
puts "----------------------------"

(1..100).select { |n| if (n%15 == 0): puts 'fizzbuzz'
elsif (n%5 == 0): puts 'buzz'
elsif (n%3 == 0): puts 'fizz'
elsif puts n
end
}

Python:

def fizzbuzz(x):
if x%15 == 0: return 'fizzbuzz'
elif x%3 == 0: return 'fizz'
elif x%5 == 0: return 'buzz'
return x

def p(x):
print x

map(p, map(fizzbuzz, range(1,101)))

SQL (MySQL):

SELECT  n, 
CASE
when n % 3 = 0 AND n % 5 = 0 then 'fizzbuzz'
when n % 5 = 0 then 'buzz'
when n % 3 = 0 then 'fizz'
END
FROM
( SELECT @n := @n + 1 as n
FROM some_small_table -- A table with more than 100 rows
(SELECT @n := 0) x -- Set variable to 0 in the begining
LIMIT 100 -- Limit to 100
) y;

Hope you enjoyed it.
Shiva

Log Parsing through Hadoop, Hive & Python

One of the primary analysis done on web access logs is some cohort analysis where one need to pull user access date time and along with other dimensions like user, ip, geo data, etc. Here I will be using Hadoop/ Hive/ Python to pull date, ip data from access log into Hadoop and run some queries. The example illustrates using Hadoop (version 0.20.1) streaming, SERDE, Hive’s (version 0.40) plugin customer mapper (get_access_log_ip).

The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.

Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.


hive> 
CREATE TABLE access_log_2010_01_25 (
request_date STRING,
remote_ip STRING,
method STRING,
request STRING,
protocol STRING,
user STRING,
status STRING,
size STRING,
time STRING,
remote_host STRING,
ts STRING,
perf STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*) time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)",
"output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s time=%9$s host=%10$s timestamp=%11$s perf=%12$s"
)
STORED AS TEXTFILE;


hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1'
> OVERWRITE INTO TABLE access_log_2010_01_25;
#- After load the data in one of the record would look like:
#- 25/Jan/2010:13:14:05 -0800 123.123.123.123 GET /xmls/public/thumbnail.xml HTTP/1.1 - 302 250 0 abcd.com 1264454045 -

Step 2: Create a target test table


hive>  CREATE  TABLE  dw_log_ip_test (dt string, remote_ip string);

Step 3: In an editor of your choice, build a simple Python script (get_access_log_ip.py) that gets “date” string from “date/time” string and “remote_ip” address as below.


#!/usr/bin/python
import sys
for line in sys.stdin.readlines():
line = line.strip()
fields = line.split('\t')
dt = fields[0].split(':')[0] #-- Get date 25/Jan/2010
ip = fields[1] #-- Get remote IP
print dt,"\t",ip

Step 4: Now extract data to dw_log_ip table and load only some limited data (10 seconds data)


hive>  FROM access_log_2010_01_25  L
> INSERT OVERWRITE TABLE dw_log_ip MAP L.request_date, L.remote_ip
> USING '/home/hadoop/hive/etl/scripts/get_access_log_ip.py' AS dt, remote_ip
> WHERE L.request_date > '25/Jan/2010:13:11:40'
> and L.request_date < '25/Jan/2010:13:11:50';

# Hive outputs some information like:
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Ended Job = job_201001251623_0094, Tracking URL = http://hadoop_main:50030/jobdetails.jsp?jobid=job_201001251623_0094
Kill Command = /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=hdfs://hadoop_main:9001 -kill job_201001251623_0094
2010-02-03 18:42:40,793 Stage-1 map = 0%, reduce = 0%
2010-02-03 18:42:53,874 Stage-1 map = 50%, reduce = 0%
2010-02-03 18:43:05,981 Stage-1 map = 100%, reduce = 0%
2010-02-03 18:43:09,012 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201001251623_0094
Ended Job = -416940662, job is filtered out (removed at runtime).
Launching Job 2 out of 2


Loading data to table dw_log_ip_test
11110 Rows loaded to dw_log_ip_test
OK

#-- Now check the results...
hive > select dt, remote_ip from dw_log_ip;
hive > select dt, remote_ip, count(1)
> from dw_log_ip
> group by dt, remote_ip;

Enjoy,
Shiva

Installing MySQL Python Package on Windows

I work in *nix environment but sometimes you may want to test some features on your own Windows desktop either due to access constraints on test or development systems. For example, company not allowing to install the latest packages or non-standard tools or you may be lazy and have setup your own little world on a desktop to do your tests, etc.  Anyway, I needed to install Python MySQL package and installing that package is not as straight forward as installing it on Linux systems.

> python setup.py install

The above one typically takes care of it in *nix environment. But in windows you may see following error
“Unable to find vcvarsall.bat” and if so you should download and install Microsoft Visual C++ Express Edition.  Installing whole Visual SDK kit seem like like a overkill for getting one .BAT file.

Now to install MySQL Python package, download it from sourceforge , gunzip it, untar it and make the following change before running install.

Change registry_key entry in site.cfg file to the entry in Windows registry of MySQL (SOFTWARE\MySQL AB\MySQL Server X.X) where X.X is MySQL Server installed.  Open the Windows regedit and check for the entry (and as always backup the registry first).

Now run

> python setup.py install

and during the installation you should see something like this:

running install
running bdist_egg
running egg_info
writing MySQL_python.egg-info\PKG-INFO
writing top-level names to MySQL_python.egg-info\top_level.txt
writing dependency_links to MySQL_python.egg-info\dependency_links.txt
reading manifest file ‘MySQL_python.egg-info\SOURCES.txt’
reading manifest template ‘MANIFEST.in’
writing manifest file ‘MySQL_python.egg-info\SOURCES.txt’
installing library code to build\bdist.win32\egg
running install_lib


Processing MySQL_python-1.2.3c1-py2.6-win32.egg
Copying MySQL_python-1.2.3c1-py2.6-win32.egg to c:\python26\lib\site-packages
Adding MySQL-python 1.2.3c1 to easy-install.pth file
Installed c:\python26\lib\site-packages\mysql_python-1.2.3c1-py2.6-win32.egg
Processing dependencies for MySQL-python==1.2.3c1
Finished processing dependencies for MySQL-python==1.2.3c1

Change directory to tests sub-directory and run

> python test_MySQLdb_capabilities.py

HTH,
Shiva