MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015 Booking.com is available in more than 40 languages So Unicode is important to us. Also my name is Daniël, not Daniel Also my name is Daniël, not Daniël First some history ASCII Encodes characters as 7-bit The 8th bit can be used as parity, but that was never common. The “3568 ASCII” astroid is named after it. ISO-8859 Uses the extra bit to be able to store a second set of 127 characters The base characters (<127) are shared between ASCII and ISO-8859-? The other characters differ per country/region Windows-1252 (CP1252) is mostly identical with ISO-8859-1 ISO-8859-1 is also known as Latin1 Latin1 in MySQL is not ISO-8859-1, but CP1252. Unicode Allows you to store text in any language Allows you to store text combining multiple languages in the same file Each character gets a number (a.k.a. code point) and a description. That doesn't guarantee your font will display it. UTF-8 This is an character encoding for unicode. This translates from code points to a binary string. UTF-8 and ASCII share the same characters for 0<127. Non-ASCII characters are stored as 2, 3 or 4-bytes. UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32 If a byte starts with '0xxxxxxx' then it is a 1-byte character If a byte starts with '110' it is a start of a 2byte character. If a byte starts with '10' then it is a continuation of a multibyte character. If a byte starts with '1110' it is the start of a 3-byte character. If a byte starts with '11110' it is the start of a 4-byte character. Examples: a = 01100001 ë = 11000011 10101011 UTF-8 And MySQL Some reasons to use UTF-8 in MySQL Non-english scripts like Chinese, Cyrillic or Greek. Emoji (including ☰ in the help text of your mobile app) utf8 in MySQL is an alias for utf8mb3 utf8mb3 can store 3-byte UTF-8 utf8mb4 can store 4-byte UTF-8 Best practice: Always use utf8mb4, don't use utf8 Where to set the encoding? It is set on a per-column basis There is a per-table default There is a per-database default There is a per-server default: character_set_server Connections also have a character set Drawbacks of UTF-8 So just set everything to utf8mb4? It depends Does your application support it? CHAR(10) suddenly needs 40 bytes! TINYTEXT has a size limit in bytes The MEMORY storage engine expands VARCHAR(10) to 40 bytes With InnoDB your index grows over 767 bytes. Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it. Converting your data How to convert from latin1 to utf8mb4? ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4; But I have many columns! Use CONCAT() and information_schema to generate the statements Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4; Change defaults Set character_set_server ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4; ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4; Common failures Application looks okay, but in MySQL the data looks wrong The latin1 column was holding utf8 data already Wrong conversion == garbage Change column to varbinary and then to utf8mb4 to not convert the data. The conversion fails and eats your data Use sql_mode='STRICT_ALL_TABLES' Now the operation will fail instead of truncate your data Connection set to utf8, but data is 4-byte UTF-8. Collation support ● There is no utf8mb4_general_cs (case sensitive) ● There is utf8mb4_unicode_ci ● And utf8mb4_unicode_520_ci ● And utf8mb4_bin Special collations get lost during conversion ALTER TABLE…CONVERT TO… only supports one collation Safe collation before the ALTER and then restore it for columns which have a nondefault collation. Collation mismatch Use COLLATE to set the desired collation for the operation. é and ◌◌́ + e are not identical Unicode normalization forms ● NFC Composed ● NFD Decomposed ● NFKC Composed ● NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc. Best practice: Normalize strings in your application 4-byte characters get silently lost on dump/restore Set utf8mb4 as default charset for the connection mysqldump uses Questions� [email protected] @dveeden Did you know a BOM can be in the middle of a string? Fulltext search & CJK Not every character has the same width Not even if we use a monospace font A character can have a width of 0, 1, 2 or -1 positions Punycode Ligatures, Glyphs, Characters Using Unicode Fonts Typing Virtual keyboards Control characters Charset is not a constraint Replacement characters MySQL and Unicode Daniël van Eeden Percona Live Amsterdam 23 September 2015 Welcome. My name is Daniël and I work for Booking.com This presentation is about MySQL and Unicode. Booking.com is available in more than 40 languages So Unicode is important to us. Booking.com is available in more that 40 languages so Unicode is of critical importance to us This is the booking.com website in Arabic Note that the text is also flowing from right to left Also my name is Daniël, not Daniel My name is Daniël. There are dots on the 'e'. Those are important to me. Also my name is Daniël, not Daniël The image here shows examples from letters I got in the mail. This happens quite often. Also on websites. Marking every special character as illegal is not a solution. First some history Let's first start with some history about character sets ASCII Let's start with ASCII. ASCII was invented in 1963 to allow comunication between systems of different vendors. A little known fact is that ASCII was not made for computers, but for teleprinters. One of the interesting decisions of ASCII was that it does not require state. It does not encode the 'shift' Before ASCII: baudot (5-bit, 1870) Encodes characters as 7-bit In ASCII 1 character equals 1 byte. The 8th bit can be used as parity, but that was never common. The “3568 ASCII” astroid is named after it. Fun fact ISO-8859 ISO-8859 was created in 1985 and is a set of 16 character sets. The most known one is ISO-8859-1 This included more than just english When the euro was introduced they replaced ¤ with € and named it ISO8859-15 Uses the extra bit to be able to store a second set of 127 characters ISO-8859 Replaces ISO 646 (1972) which was a 7-bit mess. ECMA, the European Computer Manufacturers Association The base characters (<127) are shared between ASCII and ISO-8859-? The base characters are shared between ASCII and ISO-8859 The other characters differ per country/region Windows-1252 (CP1252) is mostly identical with ISO-8859-1 ISO-8859-1 is also known as Latin1 Latin1 in MySQL is not ISO-8859-1, but CP1252. mysql> SHOW CHARSET LIKE 'latin1'; +---------+---------------------+-------------------+--------+ | Charset | Description | Default collation | Maxlen | +---------+---------------------+-------------------+--------+ | latin1 | cp1252 West European | latin1_swedish_ci | 1| +---------+---------------------+-------------------+--------+ 1 row in set (0.00 sec) Unicode The work on Unicode started in 1987 Allows you to store text in any language Both alive and dead Allows you to store text combining multiple languages in the same file Each character gets a number (a.k.a. code point) and a description. That doesn't guarantee your font will display it. You might see a replacement character instead. This can be a question mark or some square. UTF-8 Unicode Transformation Format This is an character encoding for unicode. It is not the only unicode enconding. UTF-16 (fixed: ucs2) UTF-32 (ucs4) This translates from code points to a binary string. UTF-8 and ASCII share the same characters for 0<127. Non-ASCII characters are stored as 2, 3 or 4-bytes. UTF-32 UTF-16 UTF-8 ISO-8859-1 ASCII Baudot 0 8 16 24 32 Here you can see the minimum and maximum number of bytes required to store one character. The blue show minimum and the red shows the variable part. This shows that UTF-8 is efficient in terms of storage for latin scripts If a byte starts with '0xxxxxxx' then it is a 1-byte character If a byte starts with '110' it is a start of a 2byte character. If a byte starts with '10' then it is a continuation of a multibyte character. If a byte starts with '1110' it is the start of a 3-byte character. If a byte starts with '11110' it is the start of a 4-byte character. Examples: a = 01100001 ë = 11000011 10101011 Here you can see the letter a and the letter e with the dots (diaeresis, trema) UTF-8 And MySQL Now we get into MySQL specifics Some reasons to use UTF-8 in MySQL Non-english scripts like Chinese, Cyrillic or Greek. Names Comments URL's E-mail addresses Emoji (including ☰ in the help text of your mobile app) Hamburger icon utf8 in MySQL is an alias for utf8mb3 utf8mb3 can store 3-byte UTF-8 utf8mb4 can store 4-byte UTF-8 utf8mb4 exists since 5.5.3 Best practice: Always use utf8mb4, don't use utf8 Where to set the encoding? It is set on a per-column basis There is a per-table default There is a per-database default Stored in db.opt Use ALTER DATABASE to change it There is a per-server default: character_set_server Connections also have a character set Set the character set in your connection properties If that isn't possible: Use “SET NAMES utf8mb4” Drawbacks of UTF-8 So just set everything to utf8mb4? The question we want to answer is... It depends The answer is ... Does your application support it? Input validation Character length Security CHAR(10) suddenly needs 40 bytes! TINYTEXT has a size limit in bytes With utf8m4 you can store between 63 and 255 characters. This also happens to other TEXT types and BLOB types The MEMORY storage engine expands VARCHAR(10) to 40 bytes This affects: - User created tables - Internal temporary tables With InnoDB your index grows over 767 bytes. Use innodb_large_prefex with COMPRESSED or DYNAMIC Best practice: Use latin1 for server, database and table default. Enable Unicode on columns which need it. Or use utf8mb4 all the way if you don't need the efficiency and performance of latin1 Changing everything to VARBINARY and BLOB will not solve your issue. Converting your data How to convert from latin1 to utf8mb4? ALTER TABLE t1 MODIFY COLUMN c1 VARCHAR(100) CHARACTER SET utf8mb4; But I have many columns! Use CONCAT() and information_schema to generate the statements Or convert all columns: ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb4; Also for INSERTs! Change defaults Set character_set_server ALTER SCHEMA s1 DEFAULT CHARACTER SET utf8mb4; ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4; Common failures Application looks okay, but in MySQL the data looks wrong 'Search' in the application might not function correctly The latin1 column was holding utf8 data already Wrong conversion == garbage Don't just convert this data. Run a latin1 to UTF-8 conversion on data which already was UTF-8 will result in garbage. Change column to varbinary and then to utf8mb4 to not convert the data. The conversion fails and eats your data MySQL tries really hard to convert your data but this might not be possible. Use sql_mode='STRICT_ALL_TABLES' Now the operation will fail instead of truncate your data Also for inserts Connection set to utf8, but data is 4-byte UTF-8. You can't insert 4-byte or request 4byte characters Collation support ● There is no utf8mb4_general_cs (case sensitive) ● There is utf8mb4_unicode_ci ● And utf8mb4_unicode_520_ci ● And utf8mb4_bin unicode_ci = UCA 4.0.0 Unicode_520 – UCA 5.2.0 Latest – 8.0.0 Here we compare the sun and moon emoji. Special collations get lost during conversion Collation = Sorting & Equality ALTER TABLE…CONVERT TO… only supports one collation Safe collation before the ALTER and then restore it for columns which have a nondefault collation. Collation mismatch Here MySQL does not now which collation to use. Use COLLATE to set the desired collation for the operation. é and ◌◌́ + e are not identical Combining characters Unicode normalization forms ● NFC Composed ● NFD Decomposed ● NFKC Composed ● NFKD Decomposed NFK removes compatibility distinction and will lose information. But this is useful for search etc. Best practice: Normalize strings in your application 4-byte characters get silently lost on dump/restore Set utf8mb4 as default charset for the connection mysqldump uses This shows what we can do with a patched mysql client. This uses unicode drawing characters This shows the unicode character database imported into MySQL Questions� [email protected] @dveeden Did you know a BOM can be in the middle of a string? Also MySQL doesn't handle BOM's well Fulltext search & CJK Not every character has the same width Not even if we use a monospace font A character can have a width of 0, 1, 2 or -1 positions Punycode Ligatures, Glyphs, Characters Using Unicode Fonts Typing Virtual keyboards Control characters Charset is not a constraint Replacement characters
© Copyright 2025 Paperzz