4.
There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.
And this is how the encoding scheme works in a nutshell.
Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
7 | U+0000 | U+007F | 1 | 0xxxxxxx | |||
11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | ||
16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Source: Wikipedia (also confusingly showing 6 possible bytes when truly 4 is the maximum)
Wait, I heard there could be 6?
No.
You heard wrong.
Many people, including the highly esteemed Joel Spolsky from Joel on Software, think that UTF-8 characters can contain up to 6 bytes.
Why this confusion?
This confusion happened because of the history of Unicode.
When it started out, Unicode was supposed to remain within 16 bits. The fixed-length UCS-2 encoding could be used to store all possible 65536 code points and life would be good…
Except it wasn’t as people realized that Unicode would not be able to fit all characters and symbols and thus wouldn’t be very Universal anymore… So they decided to increase the width of their fixed-width encoding to 32 bits, any other number between 16 and 32 bits not being very practical.
2 billion is a crowd
With 32 bit numbers, and reserving one bit as the sign bit on integers most programming languages use, you’d get a possible 2 billion code points. Trying to cram all that into a variable length encoding where all of ASCII fits in a single byte, you would need… *drumroll* … 6 bytes at most. This led to early specs for UTF-8 talking about a maximum of 6 bytes per character.
However, people quickly realized that even though 64K characters might be too little for a universal character set, 2 billion was, well, overkill. So they settled on a compromise. One that solved the problem they created when their 16-bit fixed-length encoding wasn’t actually able to encode all characters anymore.
Settle for less
They limited Unicode to a possible 1,112,064 valid code points. Note that even today, at Unicode 7.0.0, there are only 112,218 characters actually defined. All other positions are unused. They even reserved the enormous amount of 137,468 code points for private use characters.
The highest possible valid code point is Ux10FFFF and they reserved 66 positions for ‘non-characters’ and 2,048 positions for ‘surrogate’ code points. Using these they could create ‘surrogate pairs’, a trick to create a new variable-length encoding named UTF-16, that was backwards compatible with the old 16-bits fixed-length UCS-2, but still able to encode all characters in the now 1.1 million+ character space of Unicode 2 and up.
All’s well that ends well
The moral of the story is that although it may at first sound like Unicode is 32 bits, if we look closer we see that in fact we could cram all Unicode characters in use today in just 17 bits. Even if, by the time we are at Unicode 99.0.0 or something, all 1.1+ million code points would actually be assigned, we could still fit it in just 21 bits.
And lo and behold… If we figure out the maximum number of bytes needed per character when we only need to encode 1.1 million possible different characters instead of 2 billion, we don’t need 6 bytes anymore. We can settle for 4. And so we did. It was made final in RFC 3629.
[…] I explained in a previous post, utf-8 characters contain [max. 4 bytes per character](https://stijndewitt.wordpress.com/2014/08/09/max-bytes-in-a-utf-8-char/). As MySQL encode only three, it can’t encode any characters outside of the Basic […]
[…] note: There are two camps when it comes to UTF-8 characters having a 4 byte or 6 byte maximum. https://stijndewitt.com/2014/08/09/max-bytes-in-a-utf-8-char/. I assume a 4 byte characters maximum for this […]
The Joel on Software post (October 8, 2003) was made before the date on RFC 3629 (November 2003), so I’m not sure if it is fair to call him out on that.
Wow great detective work Matthew! Thanks for pointing that out!
[…] stored in a single byte. The code points 128 and above are stored using 2, 3, in fact, up to 4 bytes. Thus, all the accented characters and other miscellaneous symbols are represented by multiple […]