Character set, Collation

Here is an interesting challenge that I ran into in the Data Warehouse environment. While debugging a particular business user query, I joined between a string built in a sub-query to a column in a particular fact table with some 20million rows. BTW, I was using MySQL 5.0 and Toad 4.5 to run this query and instead of getting any result I got the following error!

MySQL Database Error: Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation '='    22    0

The query was simple and similar to

SELECT s.str, s.d1, s.d2
FROM (
SELECT distinct 'string' str, dimension1 d1, dimension 2 d2
FROM table_1
WHERE ...
) s
LEFT OUTER JOIN table_2 t2
ON ...
WHERE s.str = t2.str
AND ...

The db error was indicating me that I was using two different collations for string comparison and shouldn’t be doing so. That left me with either forcing one collation to be converted to other by force or look closely at server, database, table, column character set or collations and change it appropriately, if needed.

By default Toad sets the character_set_client, character_set_collation to ‘utf8’ and the column that joined in table 2 was in ‘latin1’. Here is the first solution:

SELECT s.str, s.d1, s.d2
FROM (
SELECT distinct 'string' str, dimension1 d1, dimension 2 d2
FROM table_1
WHERE ...
) s
LEFT OUTER JOIN table_2 t2
ON ...
WHERE s.str COLLATE utf8_general_ci = t2.str
AND ...

Here the COLLATE is used in WHERE clause but it can also be used in SELECT or GROUP BY or other aggregate functions.

Then looking at server, database’s related variables indicated that they were in default setting.

character_set_server = latin1
collation_server = latin_swedish_ci
character_set_database = latin1
collation_database = latin_swedish_ci

and so was the table 2.

By setting the collation_connection and character_set_client to latin, I could effectively use the latin collation. Here is solution #2.

SET character_set_client = 'latin1';
COMMIT;
SET collation_connection = 'latin1_swedish_ci';
COMMIT;

Note, by setting the collation_connection I also set the character_set_connection to respective character set. After this, the query ran successfully.

Enjoy,
Shiva



Case Sensitive SQL Search

MySQL on default installation uses the case insensitive configuration meaning the data you stored and searched will not be case sensitive.

Let us first check our current server and database’s appropariate variables.

SHOW VARIABLES
WHERE variable_name = 'character_set_server'

> latin1

SHOW VARIABLES
WHERE variable_name = 'collation_database'

> latin1_swedish_ci

Since we are using latin based language settings, the cases are different unlike some languages that don’t have cases. And with the database collation set to latin_swedish_ci the string search will be case insensitive (ci).

For this illustration, I have created a table “country” with single column “country_name” and inserted 3 rows.
CREATE TABLE country (
country_name varchar(32) NOT NULL
);
INSERT INTO country
VALUES ('US'), ('UK'), ('ukraine')
;
COMMIT;
With the following SELECT I can look up the table and get all country names that begin with u or U.

SELECT country_name FROM country
WHERE country_name like 'u%';
I need to correct ‘ukraine’ to ‘Ukraine’ and can do so by the following statement.

UPDATE country
SET country_name = concat('U', substring_index(country_name, 'u', -1))
WHERE BINARY country_name like 'u%';
COMMIT;

NOTE the use of ‘BINARY’ in the where clause above.

This is a simple example which effectively demonstrates the use of case sensitive matching.
Enjoy,

Shiva