Use MySQL utf8mb4 if you want full Unicode support

MySQL’s utf8 is broken

MySQL really made a mess here. What they are calling utf8 really isn’t. Hidden away in the MySQL manual we can read this:

“The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.”

Loosely translated: MySQL utf8 is broken. Don’t use it.


Continue reading

Max. bytes in a UTF-8 char?


There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.

And this is how the encoding scheme works in a nutshell.

Bits of code point First code point Last code point Bytes in sequence Byte 1 Byte 2 Byte 3 Byte 4
7 U+0000 U+007F 1 0xxxxxxx      
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx    
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx  
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Source: Wikipedia (also confusingly showing 6 possible bytes when truly 4 is the maximum)

Wait, I heard there could be 6?

You heard wrong.

Continue reading

Unicode/UTF-8 in your Eclipse Java projects

I love Unicode, with the heart symbol ironically replaced with the broken character (question mark) symbolASCII is dead! Long live Unicode!

You may not yet be convinced that you should abandon all legacy encodings used to encode text in all the different languages used around the world and switch to Unicode instead. If so, please first read this excellent article from Joel on Software to explain to you why. Don’t worry, I’ll be waiting here patiently for you to come back and read how to achieve the transition in your Eclipse Java projects.

Joel on software: Unicode and Character sets

Changing to Unicode? Yes, we can!

When first reading about Unicode, codepoints, character sets, character encodings and byte order marks, you might feel overwhelmed and start wondering whether this thing is even worth your while and how difficult it will be for you to convert your projects to use it. Are you really going to sell your software in Asia? Maybe you can do without it after all?

Don’t be worried. In fact, you can immediately forget about almost everything you read, especially when you work in Java, which is already using Unicode internally. It is not that difficult at all. I will go one step further and proclaim that the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things. Stick to Unicode everywhere and you won’t have any problems with character encodings ever again. Promised!

Continue reading