Character set, Collation

Here is an interesting challenge that I ran into in the Data Warehouse environment. While debugging a particular business user query, I joined between a string built in a sub-query to a column in a particular fact table with some 20million rows. BTW, I was using MySQL 5.0 and Toad 4.5 to run this query and instead of getting any result I got the following error!

MySQL Database Error: Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation '='    22    0

The query was simple and similar to

SELECT s.str, s.d1, s.d2
FROM (
SELECT distinct 'string' str, dimension1 d1, dimension 2 d2
FROM table_1
WHERE ...
) s
LEFT OUTER JOIN table_2 t2
ON ...
WHERE s.str = t2.str
AND ...

The db error was indicating me that I was using two different collations for string comparison and shouldn’t be doing so. That left me with either forcing one collation to be converted to other by force or look closely at server, database, table, column character set or collations and change it appropriately, if needed.

By default Toad sets the character_set_client, character_set_collation to ‘utf8’ and the column that joined in table 2 was in ‘latin1’. Here is the first solution:

SELECT s.str, s.d1, s.d2
FROM (
SELECT distinct 'string' str, dimension1 d1, dimension 2 d2
FROM table_1
WHERE ...
) s
LEFT OUTER JOIN table_2 t2
ON ...
WHERE s.str COLLATE utf8_general_ci = t2.str
AND ...

Here the COLLATE is used in WHERE clause but it can also be used in SELECT or GROUP BY or other aggregate functions.

Then looking at server, database’s related variables indicated that they were in default setting.

character_set_server = latin1
collation_server = latin_swedish_ci
character_set_database = latin1
collation_database = latin_swedish_ci

and so was the table 2.

By setting the collation_connection and character_set_client to latin, I could effectively use the latin collation. Here is solution #2.

SET character_set_client = 'latin1';
COMMIT;
SET collation_connection = 'latin1_swedish_ci';
COMMIT;

Note, by setting the collation_connection I also set the character_set_connection to respective character set. After this, the query ran successfully.

Enjoy,
Shiva



Regular Expressions

MySQL regular expressions (RE) are a powerful tools that can be very useful in SQL string searches. They enable software engineer to build statements that are very concise and to handle complex string operations that otherwise wouldn’t be possible.

If you are new to regular expressions or would like to know more about them, http://www.regular-expressions.info/ is a good site to visit. You can also get a RE tutorial at net|tuts+. Following is quick list of meta characters that can get you started in using them.
. => A dot matches single character
* => An asterisk matches zero or more of previous matched tokens
? => A question mark matches zero or one time the previous matched token
$ => A dollar at the end anchors the search to the end of string
^ => A caret symbol anchors the search to the beginning of the string
| => A pipe matches either of the two. Example: abc|xyz => either ‘abc’ or ‘xyz’
{m,n} => A quantifier matching between ‘m’ and ‘n’ times. m & n are integers.
Different computer languages have some variations when it comes to more advanced searches and how they handle given character sets. MySQL uses REGEXP string function to implement and matches the string in case “in-sensitive” mode and to match otherwise see this blog.

It is important to note that MySQL REGEXP returns 1 or 0 depending on whether it finds the match in the pattern or not. If a match that your looking for in the pattern or the pattern itself is null, then it returns NULL. Other application languages may return an array or list or true/false. Thus use of REGEXP in where clause is more appropriate than in either SELECT or others.

For example:

SELECT 'AdWords' REGEXP 'A*W*';    

returns 1

Trying the following on MySQL Sakila database:

SELECT * FROM country  WHERE country REGEXP '^U.*' ;

and it returns ‘Ukraine, United Arab Emirates, United Kingdom & United States’; all the country names that begin with ‘U’ and followed by 0 (zero) or more characters.
For demonstration purpose, let us assume that we would like to get a name of the European explorer whose first name starts with ‘M’ and ends with ‘O’. Also, first and last name are not normalized and stored in single column.

SELECT name FROM (
SELECT substring_index(name, ' ', 1) REGEXP '^M.*O' YN, name
FROM (
SELECT 'Christopher Columbus' name
UNION
SELECT 'Marco Polo' name
UNION
SELECT 'Vasco Da Gama' name
UNION
SELECT 'Francisco Pizarro' name
) x
) y
WHERE YN = 1

‘^M.*O’ pattern matches any expression that begins (^) with ‘F’ and may have any character (.) zero or more times (*) and ending with ‘D’. The result is Marco Polo.
There are also many special constructs one can use to handle group of specific characters or positions in the string, etc. Below are the ones I end up using more often than others.

  • [[:<:]]token[[:>:]]

Searches the token in the string that is neither preceded nor followed by any alpha-numerals or “_” (underscore). For example,

SELECT 'Here is a token in a string' REGEXP '[[::]]'

Returns 1 (true), where as the following ones return 0 (false)

SELECT 'Here is a _token in a string' REGEXP '[[::]]'
SELECT 'Here is a tokens in a string' REGEXP '[[::]]'

  • [:character_class:]
This would match any group of characters that belong to predefined character class. Few examples of classes are “alpha” – alpha characters, “digit” – digit characters, “space” – any space, tab, newline, etc.

SELECT column1, length(column1) from table
WHERE column1 REGEXP '[[:space:]]' ;

will return only those column1 and its length if they have any space embedded in them – quite helpful when you want to identify that any space character has crept in when it is not suppose to.

REGEXP’s complex searches come with a caveat of the way the search is done at the low levels – bits and bytes. Both RLIKE and REGEXP work in byte-wise fashion and can result in unexpected results when used on multi-byte strings. For any LATIN character set one will be safe to use them but for UNICODE need to be careful.

Hope that helps,
Shiva



MySQL User Defined Variable

This is a brief introduction to user defined variables in MySQL as many times I have gotten queries regarding how to use them in dynamic queries.

In MySQL you can easily create user defined variable and use it through out that particular session or client connection. A variable is a alphanumeric characters following a “@”. In the versions 5 and above the name is case-insensitive and hence @VAR, @vaR all mean the same. For example:
set @var = 1;
The above statement creates a variable called “var” and sets it to 1. Also note you don’t need to declare it before using it.
The statement,
set @var1 = 1, @var2 = 2, @var3 = 3;
sets all three variables in single statement. You can then select the variables from
select @var1, @var2, @var3;
Now, let us say that you would like to select particular row(s) from a given table. To do so you need to first build the sql string, prepare it and then execute it. This allows you to pass the variables one or more times to the same statement.
For this let us assume we have a table (table_t) that has two columns – id and name and has 3 rows (1, ‘one’), (2, ‘two’) and (3, ‘three’). To select row with id = 1
set @var = 1;
set @st = concat(‘select id, name from table_t where id = ‘, @var);
prepare stmt from @st;
execute stmt ;
And to select with string variable like names you need to escape the single quote as below.
set @var = ‘one’
set @st = concat(‘select id, name from table_t where name = ‘, ””, @var, ””);
prepare stmt from @st;
execute stmt ;
This is a trivial example but you get the idea of how to use user defined variables. The same technique can be used to build more complex stored procedures and statements that are executed often with different variables.
Cheers,
Shiva