MySQL’s utf8 is broken
MySQL really made a mess here. What they are calling
utf8 really isn’t. Hidden away in the MySQL manual we can read this:
“The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.”
Loosely translated: MySQL utf8 is broken. Don’t use it.
As I explained in a previous post, utf-8 characters contain max. 4 bytes per character. As MySQL encode only three, it can’t encode any characters outside of the Basic Multilingual Plane.
The problem is that the name `utf8` for the broken encoding is so misleading that hardly any developer will think twice about it, unless it’s pointed out to them. Which is why I wrote this post. So that I won’t have to type this in again and again in all the forums and stack overflow posts where people are confused by this.
Fixing MySQL’s Unicode support
Fortunately for us it’s possible (and relatively easy) to fix this situation. Just remember this:
MySQL’s utf8 is really not utf8. It’s fake utf8. MySQL calls the real utf8 `utf8mb4`
So just use `utf8mb4` whenever you need utf8 and you’ll be fine.
Read more about it on Mathias Bynens’ blog and Joni Salonen’s blog.
Oh and if you need to sort in Java, the same way that MySQL sorts in the DB (using a collation), refer to this great blog post by Stefan Fußenegger:
Using MySQL Collations in Java