Hourly web log analysis through Hadoop

Many a time one may want to parse the weblogs for doing some quick analysis on AB tests or for security/fraud alerts or recent advertisement or campaign analysis. There are many applications or utilities that perform web log analysis but more often than not regular expressions provide a powerful and elegant ways to analyse these logs and especially handy when one is dealing with massive and quickly rotating web logs.   Check out this wiki for more general info on web analytics.

When each weblog is multi-gigabytes and moved to archive every couple of hours and there is farm of web servers (hundreds or thousands of web servers), many of the vendor or 3rd party applications don’t scale up either. Hadoop streaming with simple utilities can provide insights into what otherwise would have been a costly experiment.

Here is an regular expession used to extract HOUR and the string that is of interest from the Apache web logs. Each entry in the log has the format similar to the ones below.

01/Jun/2010:07:09:26 -0500] – “GET /apache_pb.gif HTTP/1.1” status=200 size=2326 ab=test1AB ….

01/Jun/2010:07:09:26 -0500] – “GET /apache_pb.gif HTTP/1.1” status=200 size=2326 ab=test-2CD ….

For more details on Apache log configuration, take a look at its specs and custom logs.  Along with time of the day of the site hit, user IP address, http request (GET or POST), page requested, protocol used, one can configure the web server to log many more details including referer, user agent (browser), environment variables, etc.


Though the above regular expression may look cryptic, it is straight forward to extract hour and string of interest to us.

(? Matches any four digit string (in this case year) FOLLOWED BY “:” and two digits (HOUR) .  The beginning “(?<=" is a positive-look-behind assertion engine anchors to look for.  If anywhere in the log line two digits appear without preceded by four digits (year) and colon (:), then it is a non-match.

(\d{2})  => Matches two digit string (HOUR).  The parenthesis around the two digits activates storing these two digits in a special variable for future use.  As I would like to summarize at the hourly level, I concatenate these digits with other matched string (1AB) and create simple key=value pair.  Value is simply an identity (1) which later then passed to hadoop reducer to count.

(?=:\d{2}:\d{2}\s+[-\d{4}])  => Similar to positive-look-behind, this tells the regular expression engine to look-ahead for positive match from the 2 digit hour string.  “(?=” indicates from the current position look-ahead and get colon (:) followed by 2 digits (minutes), another colon (:) 2 digits (seconds) further followed by one or more spaces (\s+).
      [-\d{4}]  =>  Tries to matches optional ([ ]) “- and 4 digits”. This is the token to match time zone (-5000 above).

Once the string string is matched, and hour is extracted by the engine, it then

.*?  => matches any character (denoted by “.”) followed by any thing by zero or more (denoted by “*”) other characters in non-greedy mode (denoted by “?”) until it gets to “ab=” string.  Engine actually traces all the way to the end of string and then backtracks it to find “ab=” to meet the minimal match of “.*?”.

ab=  => environment variable that is of interest to us followed by

\D+  => any non-digit character followed by one or more times until it trys to match

(\d{1}[a-zA-Z]{2,})  => a string to match that has single digit followed by two or more characters (a to z, lower or upper case) and store this match for future use until

\b  => a word boundary.  A boundary between a word character that includes “a to z, 1 to 9 and _”.  You can visualize this as a thin line-boundary between word characters and any non-word characters like “.” or “-” or “$”, etc.

In the parser application, you can concatenate first match (Hour – 07) and second matched string (Test string – 1AB) like “1AB_07” and set the value to “1”.

When the Hadoop mapper passes this to reducer, reducer simply has to count the 1s and to get sum of hourly hits of “1AB”.  You can then load this data to some table at (date and) hourly level and maintain its history for further analysis.  Similar analysis on web logs can yield IP level, page level and many other useful info.

Note: While using regular expression it helps to understand how the backtracking technique works and its implications on performance.  Backtracking happens when one uses quantifiers like “*, *?, +, +?, {n,m}, and {n,m}?”.   Take a look at interesting articles here and here.

Adding disk space to Ubuntu (guest OS)

Recently, in one of my VMWare system (with Windows XP host OS and Ubuntu guest OS), I ran into disk space shortage while installing a package (Thrift – A cross language service).  Though there were many articles on the Internet in helping extending the guest OS partitions, they were for older versions or not applicable any more or had many complex steps.  On spending some time and digging more on the issue, I found pretty straight forward steps in achieving what I wanted to do – doubling the disk space for Ubuntu.

Initial setup:
  Host OS: Windows OS
  Guest OS: Ubuntu Karmic  (had 8G pre-allocated)
  Disk: 148G Total and 60G Free out of which 8G to be allocated to Ubuntu

   Double Ubuntu disk space to 16G by adding a new partition

Step 1:

  Run “disk analyzer” and get current disk space usage data that provides total, used, available and usred.  See below.  The system has less than 10% free disk space and needs more.

Step 2:

On deciding to go with allocating new disk space, you must first back up the disk and turn-off the virtual machine.  Select the virtual machine, if you have more than one in VMPlayer, and click “Edit virtual machine settings” and click Hard Disk item and set the expand disk capacity to 16G. (8 + 8) as show below.  Also make sure that you have enough free space available (in host) to allocated to guest OS.

Step 3:

Download “gparted” – a GUI disk utility by running
> sudo apt-get install gparted
> sudo gparted

Select the “unallocated” partition and select “Create partition”. In the pending operations pane you will see “Create Primary Patition …” and click “Apply All Operations” (A green check mark in tool bar).

Step 4:

Now mount this new partition. Say for example, I am going use this for /src2  directory, then
> sudo  mkdir   /src2
> sudo  mount  -t  ext4   /dev/sda3   /src2

Move some big directories to this partition or new directory and thus releasing some disk space free on /dev/sda1 in my screen shots above.  And after the move I had 60% free on /dev/sda1 partition (an important boot partition)
> mv   /big_directory   /src2/big_directory

Also for more on partitioning see http://help.ubuntu.com/community/HowtoPartition  and if you like to create a bootable USB Linux distro check out this post.


Log Parsing through Hadoop, Hive & Python

One of the primary analysis done on web access logs is some cohort analysis where one need to pull user access date time and along with other dimensions like user, ip, geo data, etc. Here I will be using Hadoop/ Hive/ Python to pull date, ip data from access log into Hadoop and run some queries. The example illustrates using Hadoop (version 0.20.1) streaming, SERDE, Hive’s (version 0.40) plugin customer mapper (get_access_log_ip).

The steps below load few thousand rows into a target table (dw_log_ip_test – data warehouse access log) “access_log_2010_01_25” then extract date from format like DD/Mon/YYYY:HH:MM:SS -800 to ‘DD/Mon/YYYY’ along with remote ip address through a Python streaming script.

Step 1: First create a table to access log (access_log_2010_01_25) and then load data into it.

CREATE TABLE access_log_2010_01_25 (
request_date STRING,
remote_ip STRING,
method STRING,
request STRING,
protocol STRING,
user STRING,
status STRING,
size STRING,
time STRING,
remote_host STRING,
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
"input.regex" = "\\[([^]]+)\\] ([^ ]*) \"([^ ]*) ([^ ]*) ([^ \"]*)\" user=([^ ]*) status=([^ ]*) size=([^ ]*) time=([^ ]*) host=([^ ]*) timestamp=([^ ]*) perf=([^ ]*)",
"output.format.string" = "%1$s %2$s \"%3$s %4$s %5$s\" user=%6$s status=%7$s size=%8$s time=%9$s host=%10$s timestamp=%11$s perf=%12$s"

hive> LOAD DATA LOCAL INPATH '/mnt/web_ser101/weblog_server101_20100125_1'
> OVERWRITE INTO TABLE access_log_2010_01_25;
#- After load the data in one of the record would look like:
#- 25/Jan/2010:13:14:05 -0800 GET /xmls/public/thumbnail.xml HTTP/1.1 - 302 250 0 abcd.com 1264454045 -

Step 2: Create a target test table

hive>  CREATE  TABLE  dw_log_ip_test (dt string, remote_ip string);

Step 3: In an editor of your choice, build a simple Python script (get_access_log_ip.py) that gets “date” string from “date/time” string and “remote_ip” address as below.

import sys
for line in sys.stdin.readlines():
line = line.strip()
fields = line.split('\t')
dt = fields[0].split(':')[0] #-- Get date 25/Jan/2010
ip = fields[1] #-- Get remote IP
print dt,"\t",ip

Step 4: Now extract data to dw_log_ip table and load only some limited data (10 seconds data)

hive>  FROM access_log_2010_01_25  L
> INSERT OVERWRITE TABLE dw_log_ip MAP L.request_date, L.remote_ip
> USING '/home/hadoop/hive/etl/scripts/get_access_log_ip.py' AS dt, remote_ip
> WHERE L.request_date > '25/Jan/2010:13:11:40'
> and L.request_date < '25/Jan/2010:13:11:50';

# Hive outputs some information like:
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Ended Job = job_201001251623_0094, Tracking URL = http://hadoop_main:50030/jobdetails.jsp?jobid=job_201001251623_0094
Kill Command = /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=hdfs://hadoop_main:9001 -kill job_201001251623_0094
2010-02-03 18:42:40,793 Stage-1 map = 0%, reduce = 0%
2010-02-03 18:42:53,874 Stage-1 map = 50%, reduce = 0%
2010-02-03 18:43:05,981 Stage-1 map = 100%, reduce = 0%
2010-02-03 18:43:09,012 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201001251623_0094
Ended Job = -416940662, job is filtered out (removed at runtime).
Launching Job 2 out of 2

Loading data to table dw_log_ip_test
11110 Rows loaded to dw_log_ip_test

#-- Now check the results...
hive > select dt, remote_ip from dw_log_ip;
hive > select dt, remote_ip, count(1)
> from dw_log_ip
> group by dt, remote_ip;


Hive Metastore Derby DB

If you are using Hive in default mode, you may see the following behavior – you get to hive client from different directories and see different results when you run a query like “show tables”.  For example, you have hive installed in /usr/local/hive and your are currently in your home directory and run

~> /usr/local/hive/bin/hive    #– get to hive
hive> create table new_table (c1 string);
hive> show tables;

Now you will see “new_table” in the list.

~> cd /tmp
/tmp> /usr/local/hive/bin/hive   #– get to hive
hive> show tables;

Now you don’t see “new_table” in your list of tables.  Those who come from typical SQL background may find it little weird in the beginning due to fact that results seem different depending on from where you started the hive client.  The reason  is because hive uses “embedded Derby” database to store meta data and one of the default configuration property is to use the current directory to store metastore_db.

On starting the hive from two different directories like above, one would see two “metastore_db” directories  created in home (~) and /tmp directories.  You can change this and use one metastore_db by updating “/usr/local/hive/conf/hive-default.xml” file’s “javax.jdo.option.ConnectionURL” as below.

Default setting:
      JDBC connect string for a JDBC metastore

Update it to:
      JDBC connect string for a JDBC metastore

“/home/hadoop” is an example and one can appropriately change it to suitable host and directory.  Say,