Unicode, this is a word that I hear and see everywhere. From computer books to government unveiling of a currency symbol and even in the simple Windows Notepad. One word in particular, UTF-8; what the heck does that mean and what is its
significance?
I know that UTF stands for Unicode Transformation Format, and Unicode defines a standard for coding character sets. The word Universal seems interesting!
Interestingly UTF-8 can be used to encode every language in the World, if not languages that are outside our World for now; Unicode consortium rejected the proposal to include Klingon in the character set, saying that it is not popular enough. But the battle is still on.
What does the 8 in UTF-8 mean though? It simply means that a document
when coded with UTF-8, will code the ASCII characters in 8 bit code.
However note that UTF-8 will use extra bytes depending for characters beyond the standard ASCII; not everything can be codded in 8 bits. UTF-8 is defined to encode characters in one to four bytes, depending on the number of significant of bits in the numerical value of the character.
EXAMPLE: UTF-8 code from bit pattern
My zsh terminal always uses the ➜ (Heavy Rounded-Tipped Rightwards Arrow) as its prompt. The problem is that not all fonts can display this arrow. But before I go looking for a replacement arrow, I need to know that is the UTF code for the current one. So here what I did.
1) Saved the arrow in a file and used ‘od’ to get the hex code for it.
The bit pattern of the arrow is: E2 9E 9C. But this is not the UTF-8 character code.
2) I begin by representing the hex in binary
E2 -> 1110 0010
9E -> 1001 1110
9C -> 1001 1100
3) Looked in the wiki page and found out that I need to strip some padding
bytes to get to the UTF-8 code. (See table below)
a) The 1110 is the padding (used to recognize the code) bits in the first
byte. I strip them and the remaining bits is part of the UTF-8 code.
Thus I got 0010 (0x:2)
b) The 10 is the padding bits in the 2nd byte. The rest is part of the
code. Hence I got 011110. (0x:2E)
c) The 10 is again the padding byte in the 3rd byte.
Hence I got 011100 (0x:2C)
Number of bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+007F | 0xxxxxxx | |||
2 | 11 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
3 | 16 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
4) Now we combine the 3 bytes back to back and we will get the UTF-8 code.
2 2E 2C
0010 011110 011100 -> 0010 0111 1001 1100 (0x:279C)
5) Thus the UTF-8 code for ➜ is 0x279C or U+279C. This character is encoded in 3
bytes when used in a document as E2 9E 9C. Preview available here https://unicode-table.com/en/279C/
Hope this document prove useful in someway and helps you understand Unicode a little more clearly. Feel free to point any mistakes and all comments are welcome.
#ascii, #character-set, #linux, #unicode, #utf, #utf-8