MySQL and Unicode

MySQL and Unicode
Daniël van Eeden
Percona Live Amsterdam
23 September 2015
Booking.com is available in more than 40
languages
So Unicode is important to us.
Also my name is Daniël, not Daniel
Also my name is Daniël, not Daniël
First some history
ASCII
Encodes characters as 7-bit
The 8th bit can be used as parity, but that
was never common.
The “3568 ASCII” astroid is named after it.
ISO-8859
Uses the extra bit to be able to store a
second set of 127 characters
The base characters (<127) are shared
between ASCII and ISO-8859-?
The other characters differ per
country/region
Windows-1252 (CP1252) is mostly identical
with ISO-8859-1
ISO-8859-1 is also known as Latin1
Latin1 in MySQL is not ISO-8859-1, but
CP1252.
Unicode
Allows you to store text in any language
Allows you to store text combining multiple
languages in the same file
Each character gets a number (a.k.a. code
point) and a description.
That doesn't guarantee your font will
display it.
UTF-8
This is an character encoding for unicode.
This translates from code points to a binary
string.
UTF-8 and ASCII share the same characters
for 0<127.
Non-ASCII characters are stored as 2, 3 or
4-bytes.
UTF-32
UTF-16
UTF-8
ISO-8859-1
ASCII
Baudot
0
8
16
24
32
If a byte starts with '0xxxxxxx' then it is a
1-byte character
If a byte starts with '110' it is a start of a 2byte character.
If a byte starts with '10' then it is a
continuation of a multibyte character.
If a byte starts with '1110' it is the start of a
3-byte character.
If a byte starts with '11110' it is the start of
a 4-byte character.
Examples:
a = 01100001
ë = 11000011 10101011
UTF-8 And MySQL
Some reasons to use UTF-8 in MySQL
Non-english scripts like Chinese, Cyrillic or
Greek.
Emoji (including ☰ in the help text of your
mobile app)
utf8 in MySQL is an alias for utf8mb3
utf8mb3 can store 3-byte UTF-8
utf8mb4 can store 4-byte UTF-8
Best practice:
Always use utf8mb4, don't use utf8
Where to set the encoding?
It is set on a per-column basis
There is a per-table default
There is a per-database default
There is a per-server default:
character_set_server
Connections also have a character set
Drawbacks of UTF-8
So just set everything to utf8mb4?
It depends
Does your application support it?
CHAR(10) suddenly needs 40 bytes!
TINYTEXT has a size limit in bytes
The MEMORY storage engine expands
VARCHAR(10) to 40 bytes
With InnoDB your index grows over 767
bytes.
Best practice:
Use latin1 for server, database and table
default. Enable Unicode on columns which
need it.
Converting your data
How to convert from latin1 to utf8mb4?
ALTER TABLE t1 MODIFY COLUMN c1
VARCHAR(100) CHARACTER SET utf8mb4;
But I have many columns!
Use CONCAT() and information_schema to
generate the statements
Or convert all columns:
ALTER TABLE t1 CONVERT TO CHARACTER
SET utf8mb4;
Change defaults
Set character_set_server
ALTER SCHEMA s1 DEFAULT CHARACTER
SET utf8mb4;
ALTER TABLE t1 DEFAULT CHARACTER SET
utf8mb4;
Common failures
Application looks okay, but in MySQL the
data looks wrong
The latin1 column was holding utf8 data
already
Wrong conversion == garbage
Change column to varbinary and then to
utf8mb4 to not convert the data.
The conversion fails and eats your data
Use sql_mode='STRICT_ALL_TABLES'
Now the operation will fail instead of
truncate your data
Connection set to utf8, but data is 4-byte
UTF-8.
Collation support
●
There is no utf8mb4_general_cs (case
sensitive)
●
There is utf8mb4_unicode_ci
●
And utf8mb4_unicode_520_ci
●
And utf8mb4_bin
Special collations get lost during conversion
ALTER TABLE…CONVERT TO… only supports
one collation
Safe collation before the ALTER and then
restore it for columns which have a nondefault collation.
Collation mismatch
Use COLLATE to set the desired collation for
the operation.
é and ◌◌́ + e are not identical
Unicode normalization forms
●
NFC Composed
●
NFD Decomposed
●
NFKC Composed
●
NFKD Decomposed
NFK removes compatibility distinction and will
lose information. But this is useful for search
etc.
Best practice:
Normalize strings in your application
4-byte characters get silently lost on
dump/restore
Set utf8mb4 as default charset for the
connection mysqldump uses
Questions�
[email protected]
@dveeden
Did you know a BOM can be in the middle
of a string?
Fulltext search & CJK
Not every character has the same width
Not even if we use a monospace font
A character can have a width of 0, 1, 2 or -1
positions
Punycode
Ligatures, Glyphs, Characters
Using Unicode
Fonts
Typing
Virtual keyboards
Control characters
Charset is not a constraint
Replacement characters
MySQL and Unicode
Daniël van Eeden
Percona Live Amsterdam
23 September 2015
Welcome.
My name is Daniël and I work for
Booking.com
This presentation is about MySQL
and Unicode.
Booking.com is available in more than 40
languages
So Unicode is important to us.
Booking.com is available in more that
40 languages so Unicode is of
critical importance to us
This is the booking.com website in Arabic
Note that the text is also flowing from right to left
Also my name is Daniël, not Daniel
My name is Daniël. There are dots on
the 'e'. Those are important to me.
Also my name is Daniël, not Daniël
The image here shows examples from
letters I got in the mail. This happens
quite often. Also on websites.
Marking every special character as
illegal is not a solution.
First some history
Let's first start with some history
about character sets
ASCII
Let's start with ASCII.
ASCII was invented in 1963 to allow
comunication between systems of
different vendors.
A little known fact is that ASCII was
not made for computers, but for
teleprinters.
One of the interesting decisions of
ASCII was that it does not require
state. It does not encode the 'shift'
Before ASCII: baudot (5-bit, 1870)
Encodes characters as 7-bit
In ASCII 1 character equals 1 byte.
The 8th bit can be used as parity, but that
was never common.
The “3568 ASCII” astroid is named after it.
Fun fact
ISO-8859
ISO-8859 was created in 1985 and is
a set of 16 character sets.
The most known one is ISO-8859-1
This included more than just english
When the euro was introduced they
replaced ¤ with € and named it ISO8859-15
Uses the extra bit to be able to store a
second set of 127 characters
ISO-8859 Replaces ISO 646 (1972)
which was a 7-bit mess.
ECMA, the European Computer
Manufacturers Association
The base characters (<127) are shared
between ASCII and ISO-8859-?
The base characters are shared
between ASCII and ISO-8859
The other characters differ per
country/region
Windows-1252 (CP1252) is mostly identical
with ISO-8859-1
ISO-8859-1 is also known as Latin1
Latin1 in MySQL is not ISO-8859-1, but
CP1252.
mysql> SHOW CHARSET LIKE
'latin1';
+---------+---------------------+-------------------+--------+
| Charset | Description
| Default
collation | Maxlen |
+---------+---------------------+-------------------+--------+
| latin1 | cp1252 West European |
latin1_swedish_ci |
1|
+---------+---------------------+-------------------+--------+
1 row in set (0.00 sec)
Unicode
The work on Unicode started in 1987
Allows you to store text in any language
Both alive and dead
Allows you to store text combining multiple
languages in the same file
Each character gets a number (a.k.a. code
point) and a description.
That doesn't guarantee your font will
display it.
You might see a replacement
character instead. This can be a
question mark or some square.
UTF-8
Unicode Transformation Format
This is an character encoding for unicode.
It is not the only unicode enconding.
UTF-16 (fixed: ucs2)
UTF-32 (ucs4)
This translates from code points to a binary
string.
UTF-8 and ASCII share the same characters
for 0<127.
Non-ASCII characters are stored as 2, 3 or
4-bytes.
UTF-32
UTF-16
UTF-8
ISO-8859-1
ASCII
Baudot
0
8
16
24
32
Here you can see the minimum and maximum
number of bytes required to store one character.
The blue show minimum and the red shows the
variable part.
This shows that UTF-8 is efficient in terms of storage
for latin scripts
If a byte starts with '0xxxxxxx' then it is a
1-byte character
If a byte starts with '110' it is a start of a 2byte character.
If a byte starts with '10' then it is a
continuation of a multibyte character.
If a byte starts with '1110' it is the start of a
3-byte character.
If a byte starts with '11110' it is the start of
a 4-byte character.
Examples:
a = 01100001
ë = 11000011 10101011
Here you can see the letter a and the
letter e with the dots (diaeresis,
trema)
UTF-8 And MySQL
Now we get into MySQL specifics
Some reasons to use UTF-8 in MySQL
Non-english scripts like Chinese, Cyrillic or
Greek.
Names
Comments
URL's
E-mail addresses
Emoji (including ☰ in the help text of your
mobile app)
Hamburger icon
utf8 in MySQL is an alias for utf8mb3
utf8mb3 can store 3-byte UTF-8
utf8mb4 can store 4-byte UTF-8
utf8mb4 exists since 5.5.3
Best practice:
Always use utf8mb4, don't use utf8
Where to set the encoding?
It is set on a per-column basis
There is a per-table default
There is a per-database default
Stored in db.opt
Use ALTER DATABASE to change it
There is a per-server default:
character_set_server
Connections also have a character set
Set the character set in your
connection properties
If that isn't possible:
Use “SET NAMES utf8mb4”
Drawbacks of UTF-8
So just set everything to utf8mb4?
The question we want to answer is...
It depends
The answer is ...
Does your application support it?
Input validation
Character length
Security
CHAR(10) suddenly needs 40 bytes!
TINYTEXT has a size limit in bytes
With utf8m4 you can store between
63 and 255 characters.
This also happens to other TEXT
types and BLOB types
The MEMORY storage engine expands
VARCHAR(10) to 40 bytes
This affects:
- User created tables
- Internal temporary tables
With InnoDB your index grows over 767
bytes.
Use innodb_large_prefex with
COMPRESSED or DYNAMIC
Best practice:
Use latin1 for server, database and table
default. Enable Unicode on columns which
need it.
Or use utf8mb4 all the way if you
don't need the efficiency and
performance of latin1
Changing everything to VARBINARY
and BLOB will not solve your issue.
Converting your data
How to convert from latin1 to utf8mb4?
ALTER TABLE t1 MODIFY COLUMN c1
VARCHAR(100) CHARACTER SET utf8mb4;
But I have many columns!
Use CONCAT() and information_schema to
generate the statements
Or convert all columns:
ALTER TABLE t1 CONVERT TO CHARACTER
SET utf8mb4;
Also for INSERTs!
Change defaults
Set character_set_server
ALTER SCHEMA s1 DEFAULT CHARACTER
SET utf8mb4;
ALTER TABLE t1 DEFAULT CHARACTER SET
utf8mb4;
Common failures
Application looks okay, but in MySQL the
data looks wrong
'Search' in the application might not
function correctly
The latin1 column was holding utf8 data
already
Wrong conversion == garbage
Don't just convert this data.
Run a latin1 to UTF-8 conversion on
data which already was UTF-8 will
result in garbage.
Change column to varbinary and then to
utf8mb4 to not convert the data.
The conversion fails and eats your data
MySQL tries really hard to convert
your data but this might not be
possible.
Use sql_mode='STRICT_ALL_TABLES'
Now the operation will fail instead of
truncate your data
Also for inserts
Connection set to utf8, but data is 4-byte
UTF-8.
You can't insert 4-byte or request 4byte characters
Collation support
●
There is no utf8mb4_general_cs (case
sensitive)
●
There is utf8mb4_unicode_ci
●
And utf8mb4_unicode_520_ci
●
And utf8mb4_bin
unicode_ci = UCA 4.0.0
Unicode_520 – UCA 5.2.0
Latest – 8.0.0
Here we compare the sun and moon
emoji.
Special collations get lost during conversion
Collation = Sorting & Equality
ALTER TABLE…CONVERT TO… only supports
one collation
Safe collation before the ALTER and then
restore it for columns which have a nondefault collation.
Collation mismatch
Here MySQL does not now which
collation to use.
Use COLLATE to set the desired collation for
the operation.
é and ◌◌́ + e are not identical
Combining characters
Unicode normalization forms
●
NFC Composed
●
NFD Decomposed
●
NFKC Composed
●
NFKD Decomposed
NFK removes compatibility distinction and will
lose information. But this is useful for search
etc.
Best practice:
Normalize strings in your application
4-byte characters get silently lost on
dump/restore
Set utf8mb4 as default charset for the
connection mysqldump uses
This shows what we can do with a
patched mysql client.
This uses unicode drawing characters
This shows the unicode character
database imported into MySQL
Questions�
[email protected]
@dveeden
Did you know a BOM can be in the middle
of a string?
Also MySQL doesn't handle BOM's well
Fulltext search & CJK
Not every character has the same width
Not even if we use a monospace font
A character can have a width of 0, 1, 2 or -1
positions
Punycode
Ligatures, Glyphs, Characters
Using Unicode
Fonts
Typing
Virtual keyboards
Control characters
Charset is not a constraint
Replacement characters