Max. bytes in a UTF-8 char?

4. There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.

And this is how the encoding scheme works in a nutshell.

Bits of code point	First code point	Last code point	Bytes in sequence	Byte 1	Byte 2	Byte 3	Byte 4
7	U+0000	U+007F	1	0xxxxxxx
11	U+0080	U+07FF	2	110xxxxx	10xxxxxx
16	U+0800	U+FFFF	3	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000	U+1FFFFF	4	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

Source: Wikipedia (also confusingly showing 6 possible bytes when truly 4 is the maximum)

Wait, I heard there could be 6?

No.
You heard wrong.

Many people, including the highly esteemed Joel Spolsky from Joel on Software, think that UTF-8 characters can contain up to 6 bytes.

Why this confusion?

This confusion happened because of the history of Unicode.

I love Unicode, with the heart symbol ironically replaced with the broken character (question mark) symbol

When it started out, Unicode was supposed to remain within 16 bits. The fixed-length UCS-2 encoding could be used to store all possible 65536 code points and life would be good…

Except it wasn’t as people realized that Unicode would not be able to fit all characters and symbols and thus wouldn’t be very Universal anymore… So they decided to increase the width of their fixed-width encoding to 32 bits, any other number between 16 and 32 bits not being very practical.

2 billion is a crowd

With 32 bit numbers, and reserving one bit as the sign bit on integers most programming languages use, you’d get a possible 2 billion code points. Trying to cram all that into a variable length encoding where all of ASCII fits in a single byte, you would need… *drumroll* … 6 bytes at most. This led to early specs for UTF-8 talking about a maximum of 6 bytes per character.

However, people quickly realized that even though 64K characters might be too little for a universal character set, 2 billion was, well, overkill. So they settled on a compromise. One that solved the problem they created when their 16-bit fixed-length encoding wasn’t actually able to encode all characters anymore.

Settle for less

They limited Unicode to a possible 1,112,064 valid code points. Note that even today, at Unicode 7.0.0, there are only 112,218 characters actually defined. All other positions are unused. They even reserved the enormous amount of 137,468 code points for private use characters.

The highest possible valid code point is Ux10FFFF and they reserved 66 positions for ‘non-characters’ and 2,048 positions for ‘surrogate’ code points. Using these they could create ‘surrogate pairs’, a trick to create a new variable-length encoding named UTF-16, that was backwards compatible with the old 16-bits fixed-length UCS-2, but still able to encode all characters in the now 1.1 million+ character space of Unicode 2 and up.

All’s well that ends well

The moral of the story is that although it may at first sound like Unicode is 32 bits, if we look closer we see that in fact we could cram all Unicode characters in use today in just 17 bits. Even if, by the time we are at Unicode 99.0.0 or something, all 1.1+ million code points would actually be assigned, we could still fit it in just 21 bits.

And lo and behold… If we figure out the maximum number of bytes needed per character when we only need to encode 1.1 million possible different characters instead of 2 billion, we don’t need 6 bytes anymore. We can settle for 4. And so we did. It was made final in RFC 3629.

5 comments

Use MySQL utf8mb4 if you want full Unicode support | Stijn de Witt's Blog says:

June 15, 2015 at 12:48

[…] I explained in a previous post, utf-8 characters contain [max. 4 bytes per character](https://stijndewitt.wordpress.com/2014/08/09/max-bytes-in-a-utf-8-char/). As MySQL encode only three, it can’t encode any characters outside of the Basic […]
Breaking down Unicode characters encoded in UTF-8 – matthewtruty says:

August 20, 2018 at 23:23

[…] note: There are two camps when it comes to UTF-8 characters having a 4 byte or 6 byte maximum. https://stijndewitt.com/2014/08/09/max-bytes-in-a-utf-8-char/. I assume a 4 byte characters maximum for this […]
Matthew Sorenson says:

November 6, 2018 at 23:56

The Joel on Software post (October 8, 2003) was made before the date on RFC 3629 (November 2003), so I’m not sure if it is fair to call him out on that.
Stijn de Witt says:

November 7, 2018 at 21:49

Wow great detective work Matthew! Thanks for pointing that out!
UTF-8 Encoding in Go – Sendinblue Engineering says:

February 4, 2022 at 08:09

[…] stored in a single byte. The code points 128 and above are stored using 2, 3, in fact, up to 4 bytes. Thus, all the accented characters and other miscellaneous symbols are represented by multiple […]

Stijn de Witt

Web Wizard

Max. bytes in a UTF-8 char?

4.

There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.

Wait, I heard there could be 6?

Why this confusion?

2 billion is a crowd

Settle for less

All’s well that ends well

5 comments

Leave a comment

Max. bytes in a UTF-8 char?

4.

There are a maximum of 4 bytes in a single UTF-8 encoded unicode character.

Wait, I heard there could be 6?

Why this confusion?

2 billion is a crowd

Settle for less

All’s well that ends well

Share this:

Related

5 comments

Leave a comment