Max. bytes in a UTF-8 char?

4.

There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.

And this is how the encoding scheme works in a nutshell.

Bits of code point First code point Last code point Bytes in sequence Byte 1 Byte 2 Byte 3 Byte 4
7 U+0000 U+007F 1 0xxxxxxx      
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx    
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx  
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Source: Wikipedia (also confusingly showing 6 possible bytes when truly 4 is the maximum)

Wait, I heard there could be 6?

No.
You heard wrong.

Many people, including the highly esteemed Joel Spolsky from Joel on Software, think that UTF-8 characters can contain up to 6 bytes.

Why this confusion?

This confusion happened because of the history of Unicode.

I love Unicode, with the heart symbol ironically replaced with the broken character (question mark) symbol

When it started out, Unicode was supposed to remain within 16 bits. The fixed-length UCS-2 encoding could be used to store all possible 65536 code points and life would be good…

Except it wasn’t as people realized that Unicode would not be able to fit all characters and symbols and thus wouldn’t be very Universal anymore… So they decided to increase the width of their fixed-width encoding to 32 bits, any other number between 16 and 32 bits not being very practical.

2 billion is a crowd

With 32 bit numbers, and reserving one bit as the sign bit on integers most programming languages use, you’d get a possible 2 billion code points. Trying to cram all that into a variable length encoding where all of ASCII fits in a single byte, you would need… *drumroll* … 6 bytes at most. This led to early specs for UTF-8 talking about a maximum of 6 bytes per character.

However, people quickly realized that even though 64K characters might be too little for a universal character set, 2 billion was, well, overkill. So they settled on a compromise. One that solved the problem they created when their 16-bit fixed-length encoding wasn’t actually able to encode all characters anymore.

Settle for less

They limited Unicode to a possible 1,112,064 valid code points. Note that even today, at Unicode 7.0.0, there are only 112,218 characters actually defined. All other positions are unused. They even reserved the enormous amount of 137,468 code points for private use characters.

The highest possible valid code point is Ux10FFFF and they reserved 66 positions for ‘non-characters’ and 2,048 positions for ‘surrogate’ code points. Using these they could create ‘surrogate pairs’, a trick to create a new variable-length encoding named UTF-16, that was backwards compatible with the old 16-bits fixed-length UCS-2, but still able to encode all characters in the now 1.1 million+ character space of Unicode 2 and up.

All’s well that ends well

The moral of the story is that although it may at first sound like Unicode is 32 bits, if we look closer we see that in fact we could cram all Unicode characters in use today in just 17 bits. Even if, by the time we are at Unicode 99.0.0 or something, all 1.1+ million code points would actually be assigned, we could still fit it in just 21 bits.

And lo and behold… If we figure out the maximum number of bytes needed per character when we only need to encode 1.1 million possible different characters instead of 2 billion, we don’t need 6 bytes anymore. We can settle for 4. And so we did. It was made final in RFC 3629.

One response to “Max. bytes in a UTF-8 char?

  1. Pingback: Use MySQL utf8mb4 if you want full Unicode support | Stijn de Witt's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s