Explain in simple terms: Unicode

[Vikram Mandyam] / 2020-11-15

Unicode is an important area that programmers needs to know, particularly if one is doing some form of text processing and/or internationalization and localization. Even if you are a seasoned coder, you may feel like you’re on thin ice with Unicode.

Some weeks ago, I tried to get to the bottom of Unicode, and I’m writing this article to document my understanding of it.

Common terms

First, let us look at some terms which are commonly known, but may not be as well understood.

Glyphs and Fonts

Fonts are a collection of patterns of ink on paper indicating how characters of text will look like. The pattern of ink for a given character in the set of characters is called a glyph . Depending on the font, multiple glyphs can exist for the same character.

Character set and character encoding

Every character in a character set must be identified by the computer. For this reason, every character has a code. American Standards Association created ASCII or American Standard Code for Information Interchange to achieve this.

Not only does a computer need to know the code for a character, it also needs to know how the code is to be represented. This is called encoding . For instance, in ASCII, the code for the lower case ‘a’ is ‘97’ which is encoded as a single byte ‘0x61’.

Why Unicode?

Since ASCII uses a single byte to represent the code for a character, it is not possible to support all the different characters that we have in all the languages around the world.

For this reason, a better way was looked for, to codify characters and represent them in encodings. Thus, Unicode was born.

In Unicode, code points are depicted as “U+AAAA”, where AAAA is a four digit hexadecimal number. For code points beyond U+FFFF, which is the maximum, the minimum number of digits needed to express the code point are used. For example, the code point for the letter “A” of the Kharoshthi script is expressed as U+10A00.

Note that the character representations by Unicode/ASCII specify characters in an abstract way, without specifying how the characters are rendered on screen. This is done by the rendering engine, based on information such as font/glyphs, size, shape, color, style etc. Unicode is just used for encoding.

Unicode was designed to be very compatible with ASCII, and so all ASCII code points are the same in Unicode.

Lets talk about Unicode and its encoding

Next, we need to deal with how to store and transport these Unicode representations on disk and over the wire. This is achieved with Unicode encoding. Encoding defines how a Unicode is written in terms of raw bits. There are a many types of encoding defined for storing Unicode data, involving fixed and variable width.

A fixed-width encoding is one in which every code point is represented by a fixed number of bytes, while a variable-length encoding is one in which different characters can be represented by different numbers of bytes. UTF-32 and UCS2 are fixed width, UTF-7 and UTF-8 are variable width, and UTF-16 is a variable width encoding that looks deceptively similar to a fixed width encoding.

UTF-8 Encoding

UTF-8 (Unicode Transformation Format 8-bit), a variable-length encoding, is by far the most popular encoding. Compared to fixed-width encoding such as UTF-16, UTF-8 saves a considerable amount of space for the same character. It is also very flexible, as it supports encoding of a huge range of code points. UTF-8 handles compatibility with ASCII by keeping all ASCII encodings untouched, and using the higher order bits for higher code points.

Since UTF-8 is a variable length encoding, it is necessary for it to encode the length of the code point’s representation in bytes into the first byte of the encoding itself, and then use subsequent bytes to add the representable bits.

In one byte of a UTF-8 character encoding, we will get between 0 and 7 bits to the final code point. This reads like a long binary number from left to right. The bits that make up the binary representation of each code point are based on the bit masks shown below:

Bytes	Bits	Byte Mask
1	7	0bbbbbbb
2	11	110bbbbb 10bbbbbb
3	16	1110bbbb 10bbbbbb 10bbbbbb
4	21	11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
5	26	111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb
6	31	1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb
7	36	11111110 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb
8	42	11111111 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

In the above representation, ‘1’s and ‘0’s are fixed, while the ‘b’s are the placeholders for bits from the Unicode’s binary representation.

For e.g. let us look at the code point, U+041F, which is the capital letter PE of the language Cyrillic. This is represented in binary as 100 0001 1111, which is 11 bits in length. Using the above bit mask table, we can use row 2. Replacing the ‘b’ with the bits we get the final encoded value as 0xd0 0x9.

Byte Order mark

The big question that remains is how to determine if a given Unicode stream is UTF-8 or UTF-16 i.e which encoding type is used. This is provided by byte order mark. A byte order mark (BOM) is a sequence of bytes placed at the beginning of a Unicode stream. Multibyte Unicode encodings such as UTF-16 store the bytes that constitute a BOM code point in either order (highest or lowest byte first), depending on the endianness of the system (since systems can be big endian or little endian).

For a UTF-8 encoded stream, the BOM 0xEF,0xBB,0xBF is encoded in UTF-8 and placed at the beginning of the stream. The actual output in bytes depends on the encoding itself and the endianness, so after reading the first four bytes of a Unicode stream, you can easily figure out the encoding type :

Encoding	Byte Order Mask
UTF-16 big endian	0xfe 0xff
UTF-16 little endian	0xff 0xfe
UTF-32 big endian	0x00 0x00 0xfe 0xff
UTF-32 little endian	0xff 0xfe 0x00 0x00
UTF-8 little endian	0xef 0xbb 0xbf

Note that UTF-8 has only little endian

In closing

Since UTF-8 is the most commonly used encoding, I will stop here and let you mull over what you’ve just read..I hope I’ve managed to simplify Unicode for you!