Hive Metastore Derby DB

If you are using Hive in default mode, you may see the following behavior – you get to hive client from different directories and see different results when you run a query like “show tables”.  For example, you have hive installed in /usr/local/hive and your are currently in your home directory and run

~> /usr/local/hive/bin/hive    #– get to hive
hive> create table new_table (c1 string);
hive> show tables;

Now you will see “new_table” in the list.

~> cd /tmp
/tmp> /usr/local/hive/bin/hive   #– get to hive
hive> show tables;

Now you don’t see “new_table” in your list of tables.  Those who come from typical SQL background may find it little weird in the beginning due to fact that results seem different depending on from where you started the hive client.  The reason  is because hive uses “embedded Derby” database to store meta data and one of the default configuration property is to use the current directory to store metastore_db.

On starting the hive from two different directories like above, one would see two “metastore_db” directories  created in home (~) and /tmp directories.  You can change this and use one metastore_db by updating “/usr/local/hive/conf/hive-default.xml” file’s “javax.jdo.option.ConnectionURL” as below.

Default setting:
   
      javax.jdo.option.ConnectionURL
      jdbc:derby:;databaseName=metastore_db;create=true
      JDBC connect string for a JDBC metastore
   

Update it to:
   
      javax.jdo.option.ConnectionURL
      jdbc:derby:;databaseName=/home/hadoop/metastore_db;create=true
      JDBC connect string for a JDBC metastore
   

“/home/hadoop” is an example and one can appropriately change it to suitable host and directory.  Say,
jdbc:derby:;databaseName=//localhost:1234/metastore_db;create=true

Cheers,
Shiva

Installing MySQL Python Package on Windows

I work in *nix environment but sometimes you may want to test some features on your own Windows desktop either due to access constraints on test or development systems. For example, company not allowing to install the latest packages or non-standard tools or you may be lazy and have setup your own little world on a desktop to do your tests, etc.  Anyway, I needed to install Python MySQL package and installing that package is not as straight forward as installing it on Linux systems.

> python setup.py install

The above one typically takes care of it in *nix environment. But in windows you may see following error
“Unable to find vcvarsall.bat” and if so you should download and install Microsoft Visual C++ Express Edition.  Installing whole Visual SDK kit seem like like a overkill for getting one .BAT file.

Now to install MySQL Python package, download it from sourceforge , gunzip it, untar it and make the following change before running install.

Change registry_key entry in site.cfg file to the entry in Windows registry of MySQL (SOFTWARE\MySQL AB\MySQL Server X.X) where X.X is MySQL Server installed.  Open the Windows regedit and check for the entry (and as always backup the registry first).

Now run

> python setup.py install

and during the installation you should see something like this:

running install
running bdist_egg
running egg_info
writing MySQL_python.egg-info\PKG-INFO
writing top-level names to MySQL_python.egg-info\top_level.txt
writing dependency_links to MySQL_python.egg-info\dependency_links.txt
reading manifest file ‘MySQL_python.egg-info\SOURCES.txt’
reading manifest template ‘MANIFEST.in’
writing manifest file ‘MySQL_python.egg-info\SOURCES.txt’
installing library code to build\bdist.win32\egg
running install_lib


Processing MySQL_python-1.2.3c1-py2.6-win32.egg
Copying MySQL_python-1.2.3c1-py2.6-win32.egg to c:\python26\lib\site-packages
Adding MySQL-python 1.2.3c1 to easy-install.pth file
Installed c:\python26\lib\site-packages\mysql_python-1.2.3c1-py2.6-win32.egg
Processing dependencies for MySQL-python==1.2.3c1
Finished processing dependencies for MySQL-python==1.2.3c1

Change directory to tests sub-directory and run

> python test_MySQLdb_capabilities.py

HTH,
Shiva

Data loading through Talend ETL Studio

In couple of my earlier posts I talked about loading data pulled from search engines and then loading into staging tables of a data warehouse.  You can load data in couple of ways including databases bulk load tools – for example, MySQL “LOAD DATA INFILE ” statement or using Talend’s ETL studio or similar tools (Pentaho, Informatica, etc.)

In this article, I will be loading a simple data file with more than 200,000 (200K) rows of keyword performance data for illustration purpose and the setup used was
  MySQL 5.1 (Community version)
  Talend Open Studio (3.2.2)
  Perl 5.10.1


Other posts that may be of interest:
1. URL parsing in Python, Perl, MySQL – an example
2. Google Adwords Data Retriever
3. Google Analytics Data Retriever


The data flows through 6 components (tFileInputDelimited, tPerlRow, tSortRow, tFileOutputDelimited, tFilterRow and tMysqlOutput) starting at input data file read and transformed into another output file and then loaded into database.  Component #3 (sorting component) is not needed but shown here to stress the ease with which a developer can quickly put together a ETL process.  Palette tab on the right hand side houses these components and many more.  If you use Java instead of Perl, you will likely have more components to use and for list of all available components take at Talend Forge. 

Input record format is a tab seperated fields with one record per line and so is output file in component 4. 

# Typical fields in the file are date, account_id, campaign_id, ad_group_id, keyword_id, CPC, CTR, Cost, Impressions, url etc.
# 2009-12-12   9876543210   34987650   1147382910  1234567890  0.07  
# 0.226632  247.98   15444 http://hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html

I provide the following regular expression to parse each URL to extract domain name in component 2.
     /(https?://)?(www\.)?([a-zA-Z0-9_%]*)\b\.[a-z]{2,4}(\.[a-z]{2})?/ig

And this parses any URL with correct formats like http://www.amazon.com or https://amazon.com or http://www.amazon.co.uk or amazon.com and more.

Component 4 sorts the input stream coming from the regexp output and dumps into a output file which is then read into a filter which filters out any record that doesn’t have the primary key before loading into staging database.   You can load into database without having the primary key constraint to further speed it up and cleanup the duplicates before pushing into either dimensions or facts tables.

The whole run took little over a minute (~63 seconds) running on a test system running all three (MySQL, Talend, Perl).  For daily load, you can make use of scheduler to kick-off the job and automate the loading process.

 Hope that helps,
Shiva

URL parsing in Python, Perl, MySQL

There are many situations when one needs to parse the URL. It could be in web log parsing or Google Keyword or other performance data that is downloaded through APIs. Application languages like Python, Perl, PHP etc provide many modules to parse or you can use raw regular expression to parse the URLs. Below I have shown few ways of doing it with a simple example – either in applications before loading into staging area of data warehouse or parsing it in SQL (MySQL) after the load.

In the example url http://http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html, I would like to get sub-domain name and file name parts i.e; http://www.hiregion.com and sem-google-aswords-api-data-retriever.html.


Other posts that may be of interest:
1. Regular expressions – An intro
2. 99-bottles of bear in many languages


import time
import re




t2 = time.time()
for ii in range(1000000): #-- Parse it million times...
m = re.search(r'http://(.*?)/\d{4}/\d{2}/(.*$)', 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
#- Saved to m.groups
t3 = time.time()

print 'regex took:', t3 - t2, 'seconds'




2. Using Python urlparse module


import re
from urlparse import urlparse

t2 = time.time()
for ii in range(1000000): #-- Parse a million times...
parsed = urlparse('http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html')
file = re.sub('/\d{4}/\d{2}/', '',parsed.path)
#- Saved to parsed.netloc and file

t3 = time.time()
print 'urlparse took:', t2 - t1, 'seconds'

3. Using Perl Regular Expression


Regular expression is simple and same as the one used above in Python.

#!/cygdrive/c/perl/bin/perl

use strict;
use Time::HiRes qw(gettimeofday tv_interval);

my $str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.htm';
my $th1 = [gettimeofday()];

for my $ii (1..1000000) { #-- Parse a million times
$str =~ m#http://(.*?)/\d{4}/\d{2}/(.*$)#;
#- Saved to $1 and $2
}

my $th2 = [gettimeofday()];

print "tv_interval", tv_interval($th1), "\n";

If you would like to transform url after loading the data into staging tables (say, after the bulk load), the following one can be used.

4. Using MySQL (5.1 version)


SET @str = 'http://www.hiregion.com/2009/12/sem-google-adwords-api-data-retriever.html' ;

PREPARE select_q FROM
'SELECT CASE
WHEN ? REGEXP ''^http://'' then substring_index(?, ''/'', -1)
ELSE NULL end file,
CASE
WHEN ? REGEXP ''^http://'' then substring_index(substring_index(?, "/", 3), ''/'', -1)
ELSE NULL end sub_domain
' ;
EXECUTE select_q USING @str, @str, @str, @str;

# Output...
# file sub_domain
# sem-google-adwords-api-data-retriever.html www.hiregion.com

With Perl (5.10.1) and Python (2.6) running on the same system, perl took 0.999 seconds, Python regular expressions took 2.641 seconds and urlparse took 6.5 seconds.

Cheers,
Shiva

~
~