MySQL’s utf8 is broken
MySQL really made a mess here. What they are calling
utf8 really isn’t. Hidden away in the MySQL manual we can read this:
“The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.”
Loosely translated: MySQL utf8 is broken. Don’t use it.
There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.
And this is how the encoding scheme works in a nutshell.
|Bits of code point
||First code point
||Last code point
||Bytes in sequence
Source: Wikipedia (also confusingly showing 6 possible bytes when truly 4 is the maximum)
Wait, I heard there could be 6?
You heard wrong.
ASCII is dead! Long live Unicode!
You may not yet be convinced that you should abandon all legacy encodings used to encode text in all the different languages used around the world and switch to Unicode instead. If so, please first read this excellent article from Joel on Software to explain to you why. Don’t worry, I’ll be waiting here patiently for you to come back and read how to achieve the transition in your Eclipse Java projects.
Joel on software: Unicode and Character sets
Changing to Unicode? Yes, we can!
When first reading about Unicode, codepoints, character sets, character encodings and byte order marks, you might feel overwhelmed and start wondering whether this thing is even worth your while and how difficult it will be for you to convert your projects to use it. Are you really going to sell your software in Asia? Maybe you can do without it after all?
Don’t be worried. In fact, you can immediately forget about almost everything you read, especially when you work in Java, which is already using Unicode internally. It is not that difficult at all. I will go one step further and proclaim that the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things. Stick to Unicode everywhere and you won’t have any problems with character encodings ever again. Promised!