In this blog, I will provide answers to the Q & A for the Troubleshooting Issues with MySQL Character Sets webinar.
First, I want to thank everybody for attending the March 9 MySQL character sets troubleshooting webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:
Q: We’ve had some issues converting tables from utf8 to utf8mb4. Our issue was that the collation we wanted to use – utf8mb4_unicode_520_ci – did not distinguish between spaces and ideographic (Japanese) spaces, so we were getting unique constraint violations for the varchar fields when two entries had the same text with different kinds of spaces. Have you seen this problem and is there a workaround? We were wondering if this was related to the mother-child character bug with this collation.
A: Unfortunately this issue exists for many languages. For example, in Russian you cannot distinguish “е” and “ё” if you use utf8 or utf8mb4. However, there is hope for Japanese: Oracle announced that they will implement new language-specific utf8mb4 collations in MySQL 8.0. I already see 21 new collations in my 8.0.0 installation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | mysql> show collation like '%0900%'; +----------------------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +----------------------------+---------+-----+---------+----------+---------+ | utf8mb4_0900_ai_ci | utf8mb4 | 255 | | Yes | 8 | | utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 8 | | utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | | Yes | 8 | | utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | | Yes | 8 | | utf8mb4_eo_0900_ai_ci | utf8mb4 | 273 | | Yes | 8 | | utf8mb4_es_0900_ai_ci | utf8mb4 | 263 | | Yes | 8 | | utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 | | Yes | 8 | | utf8mb4_et_0900_ai_ci | utf8mb4 | 262 | | Yes | 8 | | utf8mb4_hr_0900_ai_ci | utf8mb4 | 275 | | Yes | 8 | | utf8mb4_hu_0900_ai_ci | utf8mb4 | 274 | | Yes | 8 | | utf8mb4_is_0900_ai_ci | utf8mb4 | 257 | | Yes | 8 | | utf8mb4_la_0900_ai_ci | utf8mb4 | 271 | | Yes | 8 | | utf8mb4_lt_0900_ai_ci | utf8mb4 | 268 | | Yes | 8 | | utf8mb4_lv_0900_ai_ci | utf8mb4 | 258 | | Yes | 8 | | utf8mb4_pl_0900_ai_ci | utf8mb4 | 261 | | Yes | 8 | | utf8mb4_ro_0900_ai_ci | utf8mb4 | 259 | | Yes | 8 | | utf8mb4_sk_0900_ai_ci | utf8mb4 | 269 | | Yes | 8 | | utf8mb4_sl_0900_ai_ci | utf8mb4 | 260 | | Yes | 8 | | utf8mb4_sv_0900_ai_ci | utf8mb4 | 264 | | Yes | 8 | | utf8mb4_tr_0900_ai_ci | utf8mb4 | 265 | | Yes | 8 | | utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 8 | +----------------------------+---------+-----+---------+----------+---------+ 21 rows in set (0,03 sec) |
In 8.0.1 they promised new case-sensitive and Japanese collations. Please see this blog post for details. The note about the planned Japanese support is at the end.
Meanwhile, I can only suggest that you implement your own collation as described here. You may use utf8_russian_ci collation from Bug #51976 as an example.
Although the user manual does not list utf8mb4 as a character set for which it’s possible to create new collations, you can actually do it. What you need to do is add a record about the character set utf8mb4 and the new collation into Index.xml, then restart the server.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | <charset name="utf8mb4"> <collation name="utf8mb4_russian_ci" id="1033"> <rules> <reset>u0415</reset><p>u0451</p><t>u0401</t> </rules> </collaiton> </charset> mysql> show collation like 'utf8mb4_russian_ci'; +--------------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +--------------------+---------+------+---------+----------+---------+ | utf8mb4_russian_ci | utf8mb4 | 1033 | | | 8 | +--------------------+---------+------+---------+----------+---------+ 1 row in set (0,03 sec) mysql> create table test_yo(gen varchar(100) CHARACTER SET utf8mb4, yo varchar(100) CHARACTER SET utf8mb4 collate utf8mb4_russian_ci) engine=innodb default character set=utf8mb4; Query OK, 0 rows affected (0,25 sec) mysql> set names utf8mb4; Query OK, 0 rows affected (0,02 sec) mysql> insert into test_yo values('ел', 'ел'), ('ель', 'ель'), ('ёлка', 'ёлка'); Query OK, 3 rows affected (0,05 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> insert into test_yo values('Ел', 'Ел'), ('Ель', 'Ель'), ('Ёлка', 'Ёлка'); Query OK, 3 rows affected (0,06 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> select * from test_yo order by gen; +----------+----------+ | gen | yo | +----------+----------+ | ел | ел | | Ел | Ел | | ёлка | ёлка | | Ёлка | Ёлка | | ель | ель | | Ель | Ель | +----------+----------+ 6 rows in set (0,00 sec) mysql> select * from test_yo order by yo; +----------+----------+ | gen | yo | +----------+----------+ | ел | ел | | Ел | Ел | | ель | ель | | Ель | Ель | | ёлка | ёлка | | Ёлка | Ёлка | +----------+----------+ 6 rows in set (0,00 sec) |
Q: If receiving utf8 on latin1 charset it will be corrupted. Just want to confirm that you can reformat as utf8 and un-corrupt the data? Also, is there a time limit on how quickly this needs to be done?
A: It will be corrupted only if you store utf8 data in the latin1 column. For example, if you have a table, defined as:
1 2 3 | create table latin1( f1 varchar(100) ) engine=innodb default charset=latin1; |
And then insert a word in utf8 format into it that contains characters that are not in the latin1 character set:
1 2 3 4 5 6 7 8 9 | mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> set sql_mode=''; Query OK, 0 rows affected, 1 warning (0,00 sec) mysql> insert into latin1 values('Sveta'), ('Света'); Query OK, 2 rows affected, 1 warning (0,04 sec) Records: 2 Duplicates: 0 Warnings: 1 |
The data in UTF8 will be corrupted and can never be recovered:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | mysql> select * from latin1; +-------+ | f1 | +-------+ | Sveta | | ????? | +-------+ 2 rows in set (0,00 sec) mysql> select f1, hex(f1) from latin1; +-------+------------+ | f1 | hex(f1) | +-------+------------+ | Sveta | 5376657461 | | ????? | 3F3F3F3F3F | +-------+------------+ 2 rows in set (0,01 sec) |
However, if your data is stored in the UTF8 column and you use latin1 for a connection, you will only get a corrupted result set. The data itself will be left untouched:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | mysql> create table utf8(f1 varchar(100)) engine=innodb character set utf8; Query OK, 0 rows affected (0,18 sec) mysql> insert into utf8 values('Sveta'), ('Света'); Query OK, 2 rows affected (0,15 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> set names latin1; Query OK, 0 rows affected (0,00 sec) mysql> select f1, hex(f1) from utf8; +-------+----------------------+ | f1 | hex(f1) | +-------+----------------------+ | Sveta | 5376657461 | | ????? | D0A1D0B2D0B5D182D0B0 | +-------+----------------------+ 2 rows in set (0,00 sec) mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> select f1, hex(f1) from utf8; +------------+----------------------+ | f1 | hex(f1) | +------------+----------------------+ | Sveta | 5376657461 | | Света | D0A1D0B2D0B5D182D0B0 | +------------+----------------------+ 2 rows in set (0,00 sec) |
Q: Can you discuss how charsets affect mysqldump? Specifically, how do we dump a database containing tables with different default charsets?
A: Yes, you can. MySQL can successfully convert data that uses different character sets, so your only job is to specify option --default-character-set for mysqldump. In this case, strings in any character set you use can be converted to the character set specified. For example, if you use cp1251 and latin1, you may set option --default-character-set to cp1251, utf8 and utf8mb4. However, you cannot set it to latin1 because Cyrillic characters exist in the cp1251 character set, but do not exist in latin1.
The default value for mysqldump is utf8. You only need to change this default if you use values that are outside of the range supported by utf8 (for example, the smileys in utf8mb4).
Q: But if you use the --single-transaction option for mysqldump, you can only specify one character set in the default?
A: Yes, and this is OK: all data will be converted into this character set. And then, when you will restore the dump, it will be converted back to the character set specified in column definitions.
Q: I noticed that MySQL doesn’t support case-sensitive UTF-8 character sets. What do you recommend for implementing case-sensitive UTF-8, if it’s at all possible?
A: In the link I provided earlier, Oracle promises to implement case-sensitive collations for utf8mb4 in version 8.0.1. Before that happens, I recommend you to implement your own case-sensitive collation.
Q: How are tools like pt-table-checksum affected by charsets? Is it safe to use a 4-byte charset (like utf8mb4) as the default charset for all comparisons? Assuming our tables are a mix of latin1 , utf8 and utf8mb4.
A: With this combination, you won’t have any issues: pt-table-checksum uses a complicated set of functions that joins columns and calculates a crc32 checksum on them. In your case, all data will be converted to utf8mb4 and no conflicts will happen.
However, if you use incompatible character sets in a single table, you may get the error "Illegal mix of collations for operation 'concat_ws' ":
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | mysql> create table cp1251(f1 varchar(100) character set latin1, f2 varchar(100) character set cp1251) engine=innodb; Query OK, 0 rows affected (0,32 sec) mysql> set names utf8; Query OK, 0 rows affected (0,00 sec) mysql> insert into cp1251 values('Sveta', 'Света'); Query OK, 1 row affected (0,07 sec) sveta@Thinkie:~/build/mysql-8.0/mysql-test$ ~/build/percona-toolkit/bin/pt-table-checksum h=127.0.0.1,P=13000,u=root,D=test Diffs cannot be detected because no slaves were found. Please read the --recursion-method documentation for information. 03-18T03:51:58 Error executing EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "EXPLAIN SELECT COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*explain checksum table*/"] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 11351. 03-18T03:51:58 Error checksumming table db1.cp1251: Error executing checksum query: DBD::mysql::st execute failed: Illegal mix of collations for operation 'concat_ws' [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `f1`, `f2`, CONCAT(ISNULL(`f1`), ISNULL(`f2`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `db1`.`cp1251` /*checksum table*/" with ParamValues: 0='db1', 1='cp1251', 2=1, 3=undef, 4=undef, 5=undef] at /home/sveta/build/percona-toolkit/bin/pt-table-checksum line 10741. TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 03-18T03:51:58 2 0 0 1 0 0.003 db1.cp1251 03-18T03:51:58 0 0 2 1 0 0.167 db1.latin1 03-18T03:51:58 0 0 6 1 0 0.198 db1.test_yo ... |
The tool continues working, and will process the rest of your tables. I reported this behavior as Bug #1674266.
Thanks for attending the Troubleshooting Issues with MySQL Character Sets webinar.
Good article! We’d like to translate it into Japanese and publish on our tech blog https://techracho.bpsinc.jp/ if you are OK.
Best regards,
Yes, feel free!
Thank you very much!, Sveta! I’d notice here when published.
To be more precise. Only when character_set_client and character_set_connection are different from table charset the data will be corrupted.Then if the table has latin1 as charset, and you set you session char set as latin1, you can insert also arabic or Cyrillic character whithout losing anything.
Thank you for the addition!
I just want to alert those who read this what having data which uses not correct character set can create more issues than solve.