Unicode and UTF-8

What is Unicode and UTF-8?

Unicode is the universal list of characters, maintained by the Unicode Consortium. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

UTF-8 is a computer encoding that can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII and is the encoding used by the Chameleon apps.

Sample

Iñtërnâtiônàlizætiøn with emojis:☃💩😂❤🔥🔀 and German: ä, ö, ü End

The emojis included in the above sample are:
- snowman
- pile of poo
- face with tears of joy
- heavy black heart
- fire
- twisted rightwards arrows

Viewing Unicode Text

Even if you configure the Chameleon platform correctly you may still encounter situations where it appears to be not working. For example when opening the results of a data query through BLADE or otherwise you may see ? or squares or a question mark inside a box instead of the character you really wanted. In this case it's likely that the font you are using doesn't support the character in the text and selecting a different font can correct this. Another option is that the application you are viewing the text with didn't know the text was in UTF-8 but instead assumed a different encoding.

Configure for Unicode

The Chameleon platform can support Unicode using the UTF-8 encoding. To do so there is some configuration required.

MySQL Database

The easiest time to configure the database is when it is installed the first time.

In the my.ini file add/udpate these settings as per below: (try to find them in the file and if they don't exist then add them)

[client] default-character-set=utf8mb4
[mysql] default-character-set=utf8mb4
[mysqld] character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci skip-character-set-client-handshake

The skip-character-set-client-handshake option should cause apps like Weather Reader and RSS Reader and Twitter Reader etc. to connect to the database using the correct character set allowing it to successfully save all Unicode characters.

Realtime Changes

You can change some of these setting without restarting the server using the following commands. (The SUPER privilege is required to set global variables.)

In MySQL versions 5.6 and 5.7 you will still need to make the changes in the my.ini file as well so that the next time the server does restart, it will maintain the correct settings.

In MySQL version 8.0 you can save changes using the SET PERSIST command (needs the SUPER privilege) and the setting will be saved into the mysqld-auto.cnf file found in the data folder: C:\ProgramData\MySQL\MySQL Server 8.0\Data

The default value for character_set_server is already utf8mb4 in version 8.0

Database Backups / Dumps

Database backups need to have the correct runtime settings to allow the Unicode characters to be saved correctly to the backup files.

NOTE: the MySQL Workbench (even Workbench version 8.0) and SqlYog tools currently do not support a full Unicode character set export. Instead to correctly export the data from a database we can use the mysqldump.exe tool directly.

For more help on doing a backup see Running A Backup Before Updating Your Flow Data Server

Using MySQL Tools

If you are using MySQL Workbench or SqlYog and working with the extended Unicode characters you will need to set the connection character set to be compatible.

Run for each launch/connection

At this time it seems you do need to do this each time you launch these applications or start a new connection. There isn't a way to have it automatically set this as the default AFAIK.

NOTE: the default character set in MySQL 8 is utf8mb4 so this may not be necessary then.

To see the available Collations and Character Sets in the server:

A Brief History of utf8 in MySQL

With excerpts from: https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

There are two varieties of utf8 support in MySQL; utf8mb3 and utf8mb4.  

  • MySQL 4.1 (2004) was the first version to support character sets and collations.  The default character set was latin1, but utf8[mb3] was available as an option.  An optimization was chosen to limit utf8 to 3 bytes, enough to handle almost all modern languages

  • MySQL 5.5.3 (2010) added support for up to 4 byte utf8 using the new utf8mb4 character set.

  • MySQL 5.7 (2015) added some optimizations such as a variable length sort buffer, and also changed InnoDB’s default row format to DYNAMIC.  This allows for indexes on VARCHAR(255) with utf8mb4; something that made migrations more difficult prior.

  • MySQL 8.0 (GA Release April 2018) vastly improves the performance of utf8mb4, as well as adding several new collations.  It is now the default character set for MySQL.

 

  • MySQL 8.0.19 (and possibly earlier)

    • 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. They recommend: "Please consider using UTF8MB4 in order to be unambiguous."

    • 'utf8mb3' is deprecated and will be removed in a future release. Please use utf8mb4 instead

References

The following are some links that were useful in researching all this and may still be available and relevant as a reference and for more detailed information.

UTF-8 vs Unicode

http://www.polylab.dk/utf8-vs-unicode.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

MySQL and Unicode

First utf8mb4 was 5.5.3 released in early 2010
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

Upgrade Notes -Converting Between 3-Byte and 4-Byte Unicode Character Sets
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-conversion.html

Change the default character set
https://stackoverflow.com/questions/3513773/change-mysql-default-character-set-to-utf-8-in-my-cnf

Forcing Utf-8 Compliance for All Connections
http://blog.oneiroi.co.uk/mysql/mysql-forcing-utf-8-compliance-for-all-connections/



How to support full Unicode in MySQL databases
https://mathiasbynens.be/notes/mysql-utf8mb4

CONNECTIONS AND CHARACTER SETS AND COLLATE

https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html

For comparisons of strings with column values, collation_connection does not matter because columns have their own collation, which has a higher collation precedence.

https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html
https://dev.mysql.com/doc/connector-net/en/connector-net-connection-options.html

Functions and Collations

https://dev.mysql.com/doc/refman/5.7/en/string-functions-charset.html