Regular Expressions

MySQL regular expressions (RE) are a powerful tools that can be very useful in SQL string searches. They enable software engineer to build statements that are very concise and to handle complex string operations that otherwise wouldn’t be possible.

If you are new to regular expressions or would like to know more about them, http://www.regular-expressions.info/ is a good site to visit. You can also get a RE tutorial at net|tuts+. Following is quick list of meta characters that can get you started in using them.
. => A dot matches single character
* => An asterisk matches zero or more of previous matched tokens
? => A question mark matches zero or one time the previous matched token
$ => A dollar at the end anchors the search to the end of string
^ => A caret symbol anchors the search to the beginning of the string
| => A pipe matches either of the two. Example: abc|xyz => either ‘abc’ or ‘xyz’
{m,n} => A quantifier matching between ‘m’ and ‘n’ times. m & n are integers.
Different computer languages have some variations when it comes to more advanced searches and how they handle given character sets. MySQL uses REGEXP string function to implement and matches the string in case “in-sensitive” mode and to match otherwise see this blog.

It is important to note that MySQL REGEXP returns 1 or 0 depending on whether it finds the match in the pattern or not. If a match that your looking for in the pattern or the pattern itself is null, then it returns NULL. Other application languages may return an array or list or true/false. Thus use of REGEXP in where clause is more appropriate than in either SELECT or others.

For example:

SELECT 'AdWords' REGEXP 'A*W*';    

returns 1

Trying the following on MySQL Sakila database:

SELECT * FROM country  WHERE country REGEXP '^U.*' ;

and it returns ‘Ukraine, United Arab Emirates, United Kingdom & United States’; all the country names that begin with ‘U’ and followed by 0 (zero) or more characters.
For demonstration purpose, let us assume that we would like to get a name of the European explorer whose first name starts with ‘M’ and ends with ‘O’. Also, first and last name are not normalized and stored in single column.

SELECT name FROM (
SELECT substring_index(name, ' ', 1) REGEXP '^M.*O' YN, name
FROM (
SELECT 'Christopher Columbus' name
UNION
SELECT 'Marco Polo' name
UNION
SELECT 'Vasco Da Gama' name
UNION
SELECT 'Francisco Pizarro' name
) x
) y
WHERE YN = 1

‘^M.*O’ pattern matches any expression that begins (^) with ‘F’ and may have any character (.) zero or more times (*) and ending with ‘D’. The result is Marco Polo.
There are also many special constructs one can use to handle group of specific characters or positions in the string, etc. Below are the ones I end up using more often than others.

  • [[:<:]]token[[:>:]]

Searches the token in the string that is neither preceded nor followed by any alpha-numerals or “_” (underscore). For example,

SELECT 'Here is a token in a string' REGEXP '[[::]]'

Returns 1 (true), where as the following ones return 0 (false)

SELECT 'Here is a _token in a string' REGEXP '[[::]]'
SELECT 'Here is a tokens in a string' REGEXP '[[::]]'

  • [:character_class:]
This would match any group of characters that belong to predefined character class. Few examples of classes are “alpha” – alpha characters, “digit” – digit characters, “space” – any space, tab, newline, etc.

SELECT column1, length(column1) from table
WHERE column1 REGEXP '[[:space:]]' ;

will return only those column1 and its length if they have any space embedded in them – quite helpful when you want to identify that any space character has crept in when it is not suppose to.

REGEXP’s complex searches come with a caveat of the way the search is done at the low levels – bits and bytes. Both RLIKE and REGEXP work in byte-wise fashion and can result in unexpected results when used on multi-byte strings. For any LATIN character set one will be safe to use them but for UNICODE need to be careful.

Hope that helps,
Shiva



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s