In this post, I’ll discuss how MySQL 5.7 handles UTF8MB4 and the load data infile.
Many of my clients have told me that they do not like using the LOAD DATA INFILE statement and prefer to manually parse and load the data. The main reason they do it are issues with the character sets, specifically UTF8MB4 and the load data infile. This was surprising to me as nowadays everyone uses UTF8. MySQL 5.7 (as well as 5.6) has full support for UTF8MB4, which should fix any remaining issues (i.e., you can now load new emoji, like ?).
Last week I was investigating an interesting case where we were loading data and got the following error:
1 2 3 4 5 6 7 8 | mysql -e 'select version()' +-----------+ | version() | +-----------+ | 5.7.12 | +-----------+ $ mysql -vvv testdb < load_data.sql ERROR 1300 (HY000) at line 1: Invalid utf8mb4 character string: 'Casa N' |
The load data statement:
1 2 3 4 5 6 7 8 9 10 11 12 | LOAD DATA LOCAL INFILE 'input.psv' REPLACE INTO TABLE input CHARACTER SET utf8mb4 FIELDS TERMINATED BY '|' LINES TERMINATED BY 'rn' IGNORE 1 LINES |
The table uses the correct character set (global character set applied to all varchar fields):
1 2 3 4 5 6 | CREATE TABLE `input` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, ... `address` varchar(255) DEFAULT NULL, ... ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 |
The string looked like “Casa Nº 24”. So this should be N + U+00BA (MASCULINE ORDINAL INDICATOR, hex code: c2ba). When I do “less input.tsv”, it shows N<BA> 24. So why can’t MySQL load it?
After further investigation, we discovered the original encoding is not UTF8. WE found out by running:
1 2 | $ file -i input.tsv input.tsv: text/plain; charset=iso-8859-1 |
So the code <BA> was misleading. Also, when I got the actual character from the file, it was just one byte (UTF8 for this character should be two bytes). When MySQL parsed the UTF8 input file, it found only the first part of the multibyte UTF8 code and stopped with an error.
The original character in hex is “ba”:
1 2 | xxd -p char_ascii ba0a |
(0a is a carriage return, and “ba” is “masculine ordinal indicator”)
The UTF8 equivalent:
1 2 | $ xxd -p char_utf8 c2ba0a |
This is now two bytes (+ carriage return): c2ba
To solve the problem we can simply change the CHARACTER SET utf8mb4 to CHARACTER SET latin1 when doing a load data infile. This fixed the issue:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | Query OK, 2 rows affected (0.00 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 0 mysql> set names utf8mb4; Query OK, 0 rows affected (0.00 sec) mysql> select address from input; +--------------------------------+ | consignee_address | +--------------------------------+ | Casa Nº 24 ................... | ... +--------------------------------+ 2 rows in set (0.00 sec) |
Another option will be to detect the character set encoding (iconv can do it) and convert to UTF8.
But it worked before…?
It worked a bit differently in MySQL 5.6:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | $ mysql -e 'select version()' +-------------+ | version() | +-------------+ | 5.6.25-73.0 | +-------------+ $ mysql -vvv testdb < load_data.sql ... Query OK, 2 rows affected, 2 warnings (0.00 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 2 -------------- show warnings -------------- +---------+------+---------------------------------------------------------------------+ | Level | Code | Message | +---------+------+---------------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: 'xBA 24 ...' for column 'address' at row 1 | | Warning | 1366 | Incorrect string value: 'xBA 24 ...' for column 'address' at row 2 | +---------+------+---------------------------------------------------------------------+ 2 rows in set (0.00 sec) |
MySQL 5.7 is more strict and doesn’t allow you to insert data in the wrong format. However, it is not 100% consistent. For some characters, MySQL 5.7 will also throw a warning if disabling strict SQL mode.
Another character that caused the same issue was xC9. When loading to MySQL 5.7 with the default sql_mode (ONLY_FULL_GROUP_BY, STRICT_TRANS_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER, NO_ENGINE_SUBSTITUTION) it throws an error:
1 | ERROR 1366 (HY000) at line 1: Incorrect string value: 'xC9' for column 'address' at row 1 |
When disabling the strict mode it now defaults to warnings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> set global sql_mode = ''; Query OK, 0 rows affected (0.00 sec) Query OK, 2 rows affected, 1 warning (0.00 sec) Records: 1 Deleted: 1 Skipped: 0 Warnings: 1 -------------- show warnings -------------- +---------+------+--------------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: 'xC9' for column 'address' at row 1 | +---------+------+--------------------------------------------------------------+ 1 row in set (0.00 sec) |
Emoji in MySQL
With UTF8MB4 support (in MySQL 5.6 and 5.7), you can also insert a little dolphin into a MySQL table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | CREATE TABLE `test_utf8mb4` ( `id` int(11) NOT NULL AUTO_INCREMENT, `v` varchar(100) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; mysql> set names utf8mb4; Query OK, 0 rows affected (0.00 sec) mysql> insert into test_utf8mb4 (v) values ('Dolphin:?'); Query OK, 1 row affected (0.00 sec) mysql> select * from test_utf8mb4; +----+--------------+ | id | v | +----+--------------+ | 1 | Dolphin:? | +----+--------------+ 1 row in set (0.00 sec) |
This should help you clear up issues with UTF8MB4 and the load data infile. Have fun!
It is really good information. I guess if we don’t mention character set in load data command then by default it loads the data with the database character set. Please correct me if I am wrong. Also le me know, Is it a good idea to use iconv linux command to convert the character set ?
Thanks a lot!
GREAT INFO and Thank you! We had a situation where a client upgraded from MySQL 5.1 to MariaDB through MySQLDump but did not do any upgrades to their application. They were getting insert errors with utf8mb4 complaints into the (default) latin1 charset types of the MySQL 5.1 database. Nobody really talked about what to do in this situation anywhere. We solved it by telling MariaDB to look at the MySQL 5.1 database with this in the mariadb.cnf (e.g., my.cnf) file:
[mysqld]
character-set-server=latin1
character-set-server=latin1
collation-server = latin1_general_ci
This also solved copy/paste issues in the web-browser.
Down mariadb. Add this to my.cnf in the [mysqld] section. Comment out any utf8 or utf8mb4 stuff. Save the file then restart the database. Watch for errors. Start your application and voila, it should work.
Thanks
David
thanks a lot men you save my life 🙂