How Unicode Works: What Every Developer Needs to Know About Strings and 🦄
What is Unicode?
Waaay back in 2003 Joel Spolsky wrote about Unicode and why every developer should understand what it is and why it’s important. I remember reading that article (and have since forgotten most of it) but it really struck me how important character sets and Unicode are. So two years ago we published the first version of this blog post about Unicode. Now we figured it is about time to revisit our old friend Unicode and see why it’s important in today’s emoji filled world 🦄💩. You might not realize it, but you’re already working with Unicode if you’re working with WordPress! So let’s see what it is and why it matters to developers.
To answer the question “What is unicode?” we should first take a look at the past.
Before we get into Unicode we need to do a little bit of history (my 4 year history degree finally getting use 🎉). Back in the day when Unix was getting invented, characters were represented with 8 bits (1 byte) of memory. In those days memory usage was a big deal since, you know, computers had so little. David C. Zentgraf has a great example about how this works on his blog:
01100010 01101001 01110100 01110011 b i t s
All those 1s and 0s are binary, and they represent each character beneath. But writing in binary is hard work, and uh, would suck if you had to do it all the time. ASCII was created to help with this and is essentially a lookup table of bytes to characters.
The ASCII table has 128 standard characters (both upper and lower case a-z and 0-9). There are actually only 95 alphanumeric characters, which sounds fine if you speak English. In actual fact each character only requires 7 bits, so there’s a whole bit left over! This led to the creation of the extended ASCII table which has 128 more fancy things like Ç and Æ as well as other characters. Unfortunately that’s not enough to cover the wide variety of characters used in languages throughout the world, so people created their own encodings.
At the end of the 90s, there were at least 60 standardized (and a few less so) extended ASCII tables to keep track of. We should probably be thankful that they all at least shared the first 128 characters. But out of necessity they were using the additional 128 characters very differently, so differently that accidently selecting the wrong table could make a text unreadable. Awesome.
Character encodings broke the internet
Alright, so now we kind of know what’s up with all those bajillion character encodings you may have encountered like Microsoft’s Windows-1252 and Big5 – people needed to represent their own language and unique set of characters. And this mostly worked OK when documents weren’t shared with other computers . You know, the time before the internet.
The internet broke all of this because people started sending documents encoded in their native encoding to other people. Sometimes people weren’t using the same encoding and they’d see something like this as an email subject line:
�����[ Éf����Õì ÔÇµÇ���¢!!
To further complicate things, some encodings would use 16 bits rather than 8. This would make for massive lookup tables. Far larger than for ASCII!
ASCII? Which ASCII?
For a long time the 256 character table worked well. It was simple and efficient. There was really only one problem: which ASCII?
When we’re sending stuff to each other via the Internet, it’s important to give the recipient a chance to guess which ASCII encoding we’re using. Over the years a lot of energy has been poured into trying to make all these encodings play along nicely with email, spreadsheets, documents and web pages.
When you’re visiting a simple web page today there are a bunch of different techniques in use, including qualified guessing, that try to make sure you see the correct characters.
The most obvious place you as a developer will notice is in the HTML document itself. You could add a
<meta charset="ISO-8859-1"> tag in the
<head> of the HTML to tell the browser that you are using the western Europe latin character set. If that tag is missing, the browser will look at the response headers from the web server and may find an additional charset declaration in the
Content-Type header. The HTML document can also override the Content-Type sent by the web server by adding a
<meta http-equiv="content-type"> tag.
But the real fun didn’t start until these three different places said different things about the charset in use or when it turns out that the declared character set isn’t what is actually in use in the rest of the document. There was a reason old versions of Chrome and Firefox allowed the user to change encoding manually.
The email system had its own sad story behind supporting international characters. Much of the sadness stems from the fact that the underlying SMTP protocol still requires that the transferred content should be 7-bit. This little issue is often solved using quoted printable encoding, a technique to transfer 8-bit characters over a 7-bit protocol so that extended ASCII characters can be sent in an email. You’ve probably seen quoted printable encoding go wrong:
Although quoted printable encoding is a clever solution, it only solves part of the problem. The receiving email client still needs to figure out which of all the possible ASCII tables to use.
Almost all email sent today uses the MIME standard for the actual email content. MIME allows us to send attachments, HTML email and very often an additional plain text version of the email intended for basic, less capable email clients. In each of these MIME parts, the email client needs to add headers for
Content-Transfer-Encoding’ andContent-Type` and be sure to add the proper character set. Wikipedia lists more than 50 different email clients in a feature comparison chart. Would you bet that all of these clients handle international characters the exact same way? I wouldn’t.
In the mid 90s, people started thinking about allowing international characters in domain names.
The DNS system originally (and still) only allows for using 7-bit ASCII in domain names which means no international characters are really possible. So the same old problem needed to be solved again. But instead of reusing quoted printable, the IETF thought this one through and came up with using Punycode which is one very important step smarter.
Punycode allows encoding of any 8, 16 or 32 bit (yes, thirty two) character only using letters, digits, and hyphens found in the original 7-bit ASCII table. For instance, the swedish word for shrimp sandwich is “räksmörgås”. In puny code this would be represented as “xn--rksmrgs-5wao1o”.
So if you were to go out and buy the domain name räksmörgås.com (available at the moment) you would actually be buying xn--rksmrgs-5wao1o.com. But all modern browsers would correctly show it as “räksmörgås.com”
The clever thing here was to allow representing characters from a much larger table of characters than just the 256 characters possible with extended ASCII. Such a character table had just begun getting some real traction.
In other parts of the industry, someone finally got fed up with seeing gobbledygook in their documents, emails and web pages and decided to create Unicode to unify all these encodings.
Unicode is really just another type of character encoding, it’s still a lookup of bits -> characters. The main difference between Unicode and ASCII is that Unicode allows characters to be up to 32 bits wide. That’s over 4 billion unique values. But for various reasons not all of that space will ever be used, there will actually only ever be 1,111,998 characters in Unicode. But that should be enough for anyone.
But with Unicode, won’t all my documents, emails and web pages take up 4x the amount of space compared with ASCII? Well, luckily no. Together with Unicode comes several mechanisms to represent or encode the characters. These are primarily the UTF-8 and UTF-16 encoding schemes which both take a really smart approach to the size problem.
Unicode encoding schemes like UTF-8 are more efficient in how they use their bits. With UTF-8, if a character can be represented with 1 byte that’s all it will use. If a character needs 4 bytes it’ll get 4 bytes. This is called a variable length encoding and it’s more efficient memory wise. Unicode encodings are simply how a piece of software implements the Unicode standard.
UTF-8 saves space. In UTF-8, common characters like “C” take 8 bits, while rare characters like “💩” take 32 bits. Other characters take 16 or 24 bits. A blog post like this one takes about four times less space in UTF-8 than it would in UTF-32. So it loads four times faster.
UTF-8 is by far the most common encoding you’ll come across on the web. The great thing about UTF-8 is that the first 128 code points are exactly the same as ASCII. So UTF-8, if you’re an English speaker, is exactly the same as ASCII.
This is all important in our day and age because of the emoji 🚀. Emoji after all, are just characters – like the letter ‘a’ or ‘Z’. Because Unicode is flexible enough to use whichever amount of bits it needs, emoji can be added to Unicode character sets quite easily.
Unicode Code Points
Unicode characters can be referenced by their code point. This Stack Overflow article does a good job of explaining what a code point is:
A code point is the atomic unit (irreducible unit) of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.
The current Unicode standard defines 1,114,112 code points – that’s a lot of 🍝. Unicode further divides up all those code points into 17 planes or groupings. We don’t need to know all about the internal workings on Unicode but it’s helpful to understand where it’s coming from.
To access code points we use the following syntax:
U+(hexadecimal number of code point)
The hexadecimal numbering system is used as it’s a shorter way to reference large numbers. That’s why you’ll see things like
u1F4A9 in emoji tables.
|💩||U+1F4A9||0001 1111 0100 1010 1001|
To make things more complex, some characters can be expressed as a combination of code points.
é can be represented in Unicode as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
Problems with Unicode
Different programming languages, operating systems, even iOS Apps handle Unicode differently, and there’s still a lot of confusion out there about what Unicode actually is. Let’s look at some examples that are close to home.
We’ll start with the ElePHPant in the room, PHP. PHP’s claims on its strings documentation page that it only supports a 256-character set. What this really means is that PHP assumes that 1 byte = 1 character for strings. This is actually something I came across working on the batching feature for the Theme & Plugin Files Addon in WP Migrate DB Pro.
If you want to get the size, in bytes, of a string, just count the characters!
strlen() for a string in PHP is essentially how many bytes it takes up. Cool.
Buuuut, what about a string that contains this bad boy – 🔥. How many bytes would that be? One?
echo strlen( '🔥' );
Go home PHP you’re drunk.
This is where PHP’s multibyte string functions come in. To get the legit string length of 🔥, in characters, you’d need to use
echo mb_strlen( '🔥' );
Cool! So that works. But what was the length of 4 about with the standard
strlen()? As I mentioned earlier, PHP thinks 1 character = 1 byte, so internally it checks the memory size of a string. The 🔥 emoji actually takes up 4 bytes of memory!
What a memory hog 🐷.
In reality though, PHP only messes up Unicode if you’re manipulating strings. If you’re simply getting or outputting strings, PHP doesn’t care and will work just fine. But if you’re trying to get substrings or lengths of strings, stick with the multibyte functions.
And worth mentioning, even in PHP 8, the multibyte string library is still delivered via the mbstring extension that is not turned on by default. Be sure to enable this extension when installing PHP yourself. Most respectable WordPress hosts and control panels will have it enabled, including our own control panel, SpinupWP.
False friends in PHP
The PHP functions
utf_decode() do sound like they would be really useful when working with Unicode strings in PHP. Well, they are as long as you are 100% sure that you only ever work with the ISO-8859-1 ASCII encoding, which happens to be the default character encoding in PHP. As the PHP manual correctly points out:
Many web pages marked as using the ISO-8859-1 character encoding actually use the similar Windows-1252 encoding…
If you need to be absolutely certain you correctly convert strings to and from UTF-8 using the correct ASCII character encoding, you should have a look at the
mb_convert_string() function instead as it allows for explicitly defining the character encodings in use.
Just as with the other
mb_convert_string() is also delivered via the mbstring extension.
let poop = '💩'; console.log( poop.length );
Similar to PHP’s
let poop = 'uD83DuDCA9' console.log( poop ) console.log( poop.length )
You can use this handy tool to convert emoji or other characters to their hex escaped values.
When using functions like
String.prototype.normalize is available. It allows you to convert strings to a standardized Unicode format. This is helpful if you have strings that could have been encoded incorrectly or you are comparing string lengths.
MySQL’s issues with Unicode are where I first encountered character encoding compatibility issues. It’s also when I first started losing my hair 😢.
Like PHP, MySQL doesn’t fully support UTF-8, or really, Unicode at all. MySQL’s
utf8 encoding isn’t really UTF-8 at all. The
utf8 encoding that we were all using back in the day, only uses 3 bytes. Why? Well who on earth would need more than 3 bytes, 24 WHOLE BITS, to represent a single character! The why is a long story (I suggest you read Adam’s article if you’d like to hear it) but a fix was rolled out in 2010 that brought us the
The utf8mb4 character set has been added. This is similar to utf8, but its encoding allows up to four bytes per character to enable support for supplementary characters.
Nice. So if you’re using the
utf8 character set you won’t see a fancy 😬.
The WordPress core peeps realized this in 2015 and made
utf8mb4 the default for new installs, as well as upgraded tables to use the new encoding if possible. Pro tip, for a deeper dive in WordPress database knowledge, check out our guide to the WordPress database.
As someone who works on a database migration plugin, this one has bitten me more than once and we often have customers email us with issues migrating from a
utf8mb4 encoded database to a
utf8 encoded database.
We have a workaround , but your best bet is to make sure both sides involved in a migration use the
utf8mb4 character set.
Unicode is a common, massive character set for all the world’s languages, glyphs and emoji. The UTF encoding family is how computers know which sequence of bits should be represented as which character. However, every programming language, app and OS implements and supports Unicode differently (if at all). This is where the developer’s job gets fun 😬.
Protip: Know what encoding your strings are using, and you know, use the same encoding everywhere!
Have you had issues with Unicode in your work? Anything I’ve missed in the above? Let us know in the comments.