While we view text documents as lines of text, computers actually see them as binary data, or a series of ones and zeros. Therefore, the characters within a text document must be represented by numeric codes. In order to accomplish this, the text is saved using one of several types of character encoding.
The most popular types of character encoding are ASCII and Unicode. While ASCII is still supported by nearly all text editors, Unicode is more commonly used because it supports a larger character set. Unicode is often defined as UTF-8, UTF-16, or UTF-32, which refer to different Unicode standards. UTF stands for "Unicode Transformation Format" and the number indicates the number of bits used to represent each character. From the early days of computing, characters have been represented by at least one byte (8 bits), which is why the different Unicode standards save characters in multiples of 8 bits.
While ASCII and Unicode are the most common types of character encoding, other encoding standards may also be used to encode text files. For example, several types of language-specific character encoding standards exist, such as Western, Latin-US, Japanese, Korean, and Chinese. While Western languages use similar characters, Eastern languages require a completely different character set. Therefore, a Latin encoding would not support the symbols needed to represent a text string in Chinese. Fortunately, modern standards such as UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols.
Updated: September 24, 2010