Email parsing with regular expressions

Emails always pose interesting challenges to load and handle for doing any kind of analysis with them – analysing number of users from email providers like Yahoo, Google, Hotmail, etc. or finding top country specific domains or users with specific email pattern.

RFC 822, 2822 (April, 2001) and 5322 (October,2008) define the specification and syntax for text messages exchanged between computers.  The RFCs cover not just the email address specifications but also the envelop and contents.  Most of applications adhere to subset of these specs with more stringent email addresses.  For example, though address specs allow for ALPHA (a-z, A-Z), DIGITS (0-9) and special characters like “!”, “#”, “$”, “%”, etc. and even more special characters with quoted strings.  But typically email service providers won’t allow you to create an email address with special characters like “$” or “!”, etc.  For more spec details see section 3.4 and 3.4.1 in rfc5322RFC 3696 explains in much easier terms email addresses, URI and HTTP URL.

One can load email addresses in to staging area of data warehouse (DW) and then perform validation and cleaning through ETL’s in-built utilities or components.  But to handle many of different possibilities and ability to quickly update or modify any new format encountered, it is better to validate and clean it through application programs before loading into staging or final tables.  In this way the load can also be distributed through file split and multi-threading or processes running on different systems.

Following regular expression pulled from handles all most all email addresses (more than 99.9%) and you can tweak to it suit your needs for performance or handling specific emails only. For example, you may not want to allow any email that has special characters like #, %, $, {, }, /, *, etc.  I have explained below in detail how this RE parses an email.


Same as above with color coding for further explanation:

Address mainly has two parts – one before (part 1) “@” and one after it (part 2).

Part 1:

  1.  [a-z0-9!#$%&’*+/=?^_`{|}~-]  — A set of characters i.e., between square brackets [ ], and in this case all letters “a to z” (either upper case or lower case) and numerals “0 to 9” and special characters !#$%&’*+/=?^_`{|}~-” are allowed
  2. —  One or more times of the above character set. Meaning “joe” or “!j!” or “abc” are all valid i.e.,  Alphabets and ! are repeated one or more times. But Part1.1 will not match “(abc)” or “joe” because “(“, “)”, “” are not in the above character set.
  3.  (  — Allow grouping also start remembering the matched string.  To avoid storing use “?:” right after “(“.
  4.  ?:Don’t need to remember or store what is matched.  When grouping is done through “(“, it is also remembered in variables $1, $2, $3, etc. With “?:” following “(” indicates that not to store it. This will help with the performance but if you want the matched string to use for further processing, then you can remove “?:” from above regular expression.
  5. \. —  It indicates “.” can appear in email after bullet 1 above but “.” has significance for the regular expression (RE) itself. In RE “.” means any single character so we need to escape it and pass it “.” down to parsing engine.
  6. [a-z0-9!#$%&’*+/=?^_`{|}~-]  — See 1 above.
  7.  +  — See 2 above.  Also, if a “.” appears then there must be at least one of characters in 6.
  8. *  — Means zero or more times. In this case all characters in step 5, 6 & 7 can be optional. That is “joe.a” is valid but not “joe.”
  9.  @ — “@” sign is a must in the email address.  It should appear once.

Part 2:

     Part 2A: (?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+

  1. + — All with the outer parenthesis must appear at least once.  It enforcing rule that after “@” sign in email, there must be at least one charcter.
  2. ?: —  Don’t need to remember or store what is matched.  Helps in performance when dealing with millions of email parsing.
  3.  [a-z0-9] — First character after the “@” must be a alpha-numeral (a,b,c…z or 1,2,…9)
  4.  (?:[a-z0-9-]*[a-z0-9])“?” at the end indicates 0 or 1 time of alpha-numeral characters.  That is, after a first character in step 3 next characters must be alpha-numerals only but they are optional.
  5. \. — See step 5 in part 1.

     Part 2B:  (?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)

  1.   [A-Z]{2} — Any two characters after the final “.”. This covers country top level domains (TLD) like “uk” (for United Kingdom), “in” (India), “ca”(Canada), etc.  See this list of standard country codes.
  2.  | — Is a “OR” operator.  It matches “com” or “org” or “museum”, etc.

And final “\b” at the end is a word boundary anchor. 

During this validation and clean up you can also add length check on the email part 1 and 2.  RFCs specify of length 64 characters (octet) for part 1 (before @ character) and 255 characters (octet) after @ symbol.  These implementations can effectively handle more than 99.999% of emails.

Let me know if you have different regular expression or different way of parsing emails.


Data Profiling – An example in Talend Profiler

Data is loaded into Data Warehouse (DW) from disparate systems and sometimes from external partners who have their own coding standards.  In any case, the quality of the data loaded into the data warehouse is often variable,and especially while discovering the data one may need to load some sample data and do some analysis including initial data profiling.  During this process one may discover differences which when resolved result in much smoother data flow along the process flow.   Or it may be at later stage, say after the summarization is completed one may need to do some analysis on type of data.  In all these cases data profiling helps and Talend provides a tool (Talend Open Profiler – TOP) to quickly and efficiently perform profiling.

Data profiling – the process of examining available data in different data sources, including databases, applications, files, data transfer from external systems etc., and collecting statistics and information – improves data quality and better reporting.

In date dimension, we have nearly 220,000 rows covering Jan.01,1900 to Dec.31,2500 (7 hundred year dates) and one of the column is ‘day_of_week_name’ (cardinality 7 – MONDAY, TUESDAY….).  This table has 70 columns including date, weeks, months, names, etc. For testing purpose, I wanted to check the nulls and pattern frequency (distribution) for ‘day_of_week_name’ column.

To do so, select the column to profile (day_of_week_name), drag and drop into “Analyzed columns” of “Analysis Settings” tab.  Then pick the indicators i.e., how you want the column measured (count, range, stats, etc.) and I picked row count and NULL count along with “Pattern Frequency Table”.  Pattern frequency will count different patterns. The results in “Analysis Results” tab shows as below.

There were 219,146 row count with no NULLs and the pattern frequency indicates 31,307 of pattern AAAAAAAAA (9A Uppercase letters), 31,307 of 7A pattern, 62,613 of 8A pattern and 93,919 of 5A pattern.

9A pattern count is of ‘WEDNESDAY’ rows, 5A pattern covers ‘SUNDAY’, ‘MONDAY’ and ‘FRIDAY’.  Similarly for other days.

You can also have your own ‘UDI’ – User Defined Indicators that can add more functionality to existing indicators.  You can build them in Java jar and import them.  But anytime you do processing in the profiler, the data needs to get transferred from database and possibly slowing down the profiling.  For smaller data set it may not be noticeable but any profiling on large tables can bog down either due to memory limitation and/or network delay.

Profiler converts your indicators to appropriate query and runs it.  For example, for the above pattern match, MySQL query looks like:

SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(`day_of_week_name`,'a','a'),'b','a'),'c','a'),'d','a'),'e','a'),'f','a'),'g','a'),'h','a'),'i','a'),'j','a'),'k','a'),'l','a'),'m','a'),'n','a'),'o','a'),'p','a'),'q','a'),'r','a'),'s','a'),'t','a'),'u','a'),'v','a'),'w','a'),'x','a'),'y','a'),'z','a'),'ç','a'),'â','a'),'ê','a'),'î','a'),'ô','a'),'û','a'),'é','a'),'è','a'),'ù','a'),'ï','a'),'ö','a'),'ü','a'),'A','A'),'B','A'),'C','A'),'D','A'),'E','A'),'F','A'),'G','A'),'H','A'),'I','A'),'J','A'),'K','A'),'L','A'),'M','A'),'N','A'),'O','A'),'P','A'),'Q','A'),'R','A'),'S','A'),'T','A'),'U','A'),'V','A'),'W','A'),'X','A'),'Y','A'),'Z','A'),'Ç','A'),'Â','A'),'Ê','A'),'Î','A'),'Ô','A'),'Û','A'),'É','A'),'È','A'),'Ù','A'),'Ï','A'),'Ö','A'),'Ü','A'),'0','9'),'1','9'),'2','9'),'3','9'),'4','9'),'5','9'),'6','9'),'7','9'),'8','9'),'9','9'), COUNT(*) c 
FROM `dw_mktg`.`dim_date` t
GROUP BY REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(`day_of_week_name`,'a','a'),'b','a'),'c','a'),'d','a'),'e','a'),'f','a'),'g','a'),'h','a'),'i','a'),'j','a'),'k','a'),'l','a'),'m','a'),'n','a'),'o','a'),'p','a'),'q','a'),'r','a'),'s','a'),'t','a'),'u','a'),'v','a'),'w','a'),'x','a'),'y','a'),'z','a'),'ç','a'),'â','a'),'ê','a'),'î','a'),'ô','a'),'û','a'),'é','a'),'è','a'),'ù','a'),'ï','a'),'ö','a'),'ü','a'),'A','A'),'B','A'),'C','A'),'D','A'),'E','A'),'F','A'),'G','A'),'H','A'),'I','A'),'J','A'),'K','A'),'L','A'),'M','A'),'N','A'),'O','A'),'P','A'),'Q','A'),'R','A'),'S','A'),'T','A'),'U','A'),'V','A'),'W','A'),'X','A'),'Y','A'),'Z','A'),'Ç','A'),'Â','A'),'Ê','A'),'Î','A'),'Ô','A'),'Û','A'),'É','A'),'È','A'),'Ù','A'),'Ï','A'),'Ö','A'),'Ü','A'),'0','9'),'1','9'),'2','9'),'3','9'),'4','9'),'5','9'),'6','9'),'7','9'),'8','9'),'9','9')

In other DBMS like Oracle, PostgresSQL and DB2 it will be much smaller query since those systems provide TRANSLATE function.
Profiler provides many built-in indicators like Soundex Frequency, Mean, Median, Mode, Range, Quartile Range, etc.


Logging queries for performance monitoring

Before making any performance improvements, one need to measure it either you are trying to improve computer applications or DW queries, etc.  In one of the Data Warehouse subject area where I run more than 120K (120,00) queries everyday in an ETL environment, I log all queries to a table and track over time to see which queries are showing performance issue(s). With the following table, I was able to improve the performance by more than 50% some time.  For a query that repeatedly runs with different parameters in where clause, a small improvement adds up quickly and other times due to data skewing you can clearly see any changes needed to improve performance.   Other times any mistakes in indexes (DBA dropped it 🙂 or application user modified their query, etc. you will have a reference point to check against why a particular query or queries are slow.

Query Performance Monitoring table definition
Column Null? Type Comment
ID No int (11) Auto Increment
RUN_NUMBER No tinyint Useful when job is recovered or re-ran
APPLICATION No varchar(128) Application name to track
DOMAIN Yes varchar(128) Application domain or subject area
QUERY No varchar(4096) Actual query
DURATION No decimal(10,5) How long did query run?
ROW_COUNT Yes int Number of rows affected
RESULT No varchar(32) Query result type – succeeded or failed?
COMMENT Yes varchar(128) User comment that can be logged with each query
CREATED_DT No date Query run date
CREATED_DT_TM No timestamp/ datetime Query run date and time
CREATED_BY Yes varchar(64) Query user name

In the above table, one can use CREATED_DT_TM wherever CREATED_DT is needed while doing query performance analysis but most of these analyzes are done at date level (as most of DW ETL and summarizations are daily processes). By having date column and indexed, the query analysis queries are much faster as there is no need to apply date function on each row.

More interesting analysis including job failures, errors and query result set (ROW_COUNT) can be done with this data. You can also analyze specific database query performances. For example when a same query runs against different (mysql) databases (say, more than 100), some databases show performance getting worse due to table size or load, etc.

Once you know that there is some query performance issue, you can take remedial action.  Let me know how you have implemented query performance monitoring?


Google Analytics Plugin – Very useful one

Google Analytics (GA) is invaluable when you need to slice and dice web traffic data along numerous dimensions.  For example, by date, time, location just to name few measuring visits, vistiors, pageviews, etc.  For full list of possible combinations take a look at this documentation.

While many business users and SEO/SEM analyts use GA on regular basis to manage their Internet Marketing spend and improve site traffic for better ROI, they can chart a single measure for different segments and analyse one metric at a time.  Many a time users download their data from adwords or get GA data through their in-house API setup.  With almost all business users having used Excel as one of their main analytical tool, they prefer it over other tools (either in-house or 3rd party).  Also, as quickly as things change in Internet Marketing (IM) world, some quick analysis business user can do on their own allows them to respond to quickly to market changes.  You may find bullet 3 and 4 in this article  from Avinash Kaushik relevant.

One freeware plug-in that provides the solution is ExcellentAnalytics’ plug-in for Excel

ExcellentAnalytics’ Plug-in

It is pretty simple to use and an user can start using it immediately.  Once the data that you are interested in is downloaded through this plug-in, you can exploit Excel abilities to further analyze the data without any other (IT) application needed.

After downloading and installing the plug-in, select “ExcellentAnalytics” menu at the top when you bring up MS Excel 2007 and click on “Account” button; then enter your GA email and password for it to connect to GA.  Click on “New Query” and select the profile of interest to you and select valid dimensions and metrics.  If it is an invalid combination or query, the result set does not return any and hopefully future release of this plug-in will provide some user friendly info.

There were queries that didn’t return right result sets. For example, “Visitor Recency” metrics like  “most visits the previous visits happened: X days ago”.  Or under Filter section dimensions are assumed to be string type and any dimension that is a number can’t be entered. For example, you selected “visit count” dimension and want to filter any visit count less than 5.  You can’t.  Hey, it is a freeware and it does a decent job for quick analysis.  Use it with caution and having said that you can always check the numbers against GA’s report to make sure the plug-in didn’t mess up anything.

Sample reports: