ASCII is a character encoding standard where every character (letter, number, or symbol) in the text is assigned a numeric value in the ASCII character set (the set of characters that can be stored, used, and understood by the computer). Encoding is a simple process: look up the character’s value in the ASCII table and store its binary equivalent.
The standard ASCII character set includes 128 characters with numeric values from 0-127. Some of these are special characters that are no longer in use[1]. Each character requires 7 bits of storage (which leaves the most significant bit available to be used as the parity bit for data transmission). There is also an extended character set which uses all eight bits available to give 256 characters, with numeric codes from 0-255. Part of the ASCII table is shown below (the SPACE character, which is number 32 in the ASCII table, and non-printing characters, are not shown):
33 | ! | 53 | 5 | 73 | I | 93 | ] | 113 | q | ||||
34 | “ | 54 | 6 | 74 | J | 94 | ^ | 114 | r | ||||
35 | # | 55 | 7 | 75 | K | 95 | _ | 115 | s | ||||
36 | $ | 56 | 8 | 76 | L | 96 | ` | 116 | t | ||||
37 | % | 57 | 9 | 77 | M | 97 | a | 117 | u | ||||
38 | & | 58 | : | 78 | N | 98 | b | 118 | v | ||||
39 | ‘ | 59 | ; | 79 | O | 99 | c | 119 | w | ||||
40 | ( | 60 | < | 80 | P | 100 | d | 120 | x | ||||
41 | ) | 61 | = | 81 | Q | 101 | e | 121 | y | ||||
42 | * | 62 | > | 82 | R | 102 | f | 122 | z | ||||
43 | + | 63 | ? | 83 | S | 103 | g | 123 | { | ||||
44 | , | 64 | @ | 84 | T | 104 | h | 124 | | | ||||
45 | – | 65 | A | 85 | U | 105 | i | 125 | } | ||||
46 | . | 66 | B | 86 | V | 106 | j | 126 | ~ | ||||
47 | / | 67 | C | 87 | W | 107 | k | ||||||
48 | 0 | 68 | D | 88 | X | 108 | l | ||||||
49 | 1 | 69 | E | 89 | Y | 109 | m | ||||||
50 | 2 | 70 | F | 90 | Z | 110 | n | ||||||
51 | 3 | 71 | G | 91 | [ | 111 | o | ||||||
52 | 4 | 72 | H | 92 | \ | 112 | p |
Encoding text into binary using the ASCII table is a very simple process:
- Split the text into characters
- Find the denary equivalent of every character by looking it up in the ASCII table
- Convert each denary value into its binary equivalent
For example, the text “Hello” would be encoded like so:
To decode the binary back into plain text, you would reverse the process.
ASCII is very simple, but has some limitations:
- Only 128 characters can be represented in ASCII (or 256 if using the extended character set)
- The extended character set is not always encoded the same way for all systems (for example, Microsoft does not use the same character set as other publishers, which can lead to problems).
- 128 characters is plenty for the Latin alphabet, but nowhere near enough to include characters for languages that utilise different alphabets
As computer hardware decreased in cost, and international support became more and more of an issue, Unicode was developed to allow for a wider range of characters to be encoded in binary.
The first significant difference between Unicode and ASCII is that Unicode uses at least 16 bits to encode each character. This allows for character sets containing up to 65536 symbols. The Unicode standard also supports multiple character sets called planes. There are 17 planes (the Basic Multilingual Plane, which contains the characters used for most languages and purposes, and 16 additional planes for languages such as Chinese, which contains around 50000 symbols).
Even though Unicode exists to support the massive number of characters used in alphabets and other applications worldwide, it has also been designed to be as efficient as possible by employing some tricks:
- Unicode is a superset of ASCII: This means that Unicode supports every ASCII character, and much more besides. The standard ASCII character set runs from 0 to 127, requiring only 7 bits per character. As a result, the MSB is always 0 for an ASCII character. Unicode is designed so that none of its characters (outside of the standard ASCII character set) start with a ‘0’. This makes it possible to recognise ASCII characters and encode them in only one byte (rather than the 16 bit minimum for a Unicode character).
- Basic Multilingual Plane characters require 16 bits: The developers of Unicode know that most characters will be either present in the ASCII character set or come from the Basic Multilingual Plane. Characters from the latter only require 16 bits per character.
- Characters from other planes require 24 or 32 bits: Uncommon characters and those from less common languages require 24 or 32 bits each. This maximises support for many alphabets while keeping storage and transmission requirements as low as possible.
Unicode is also completely standardised, meaning that every character is encoded in the same way on every computer that uses the standard, no matter where it is or which manufacturer made it.
[1] For example, the LF or Line Feed character would be used to move paper through an electric typewriter by one line. This is no longer used, of course, but at the time that ASCII was created it was necessary, as almost all output from a computer was sent to an electric typewriter rather than a screen.