Understanding Character Encoding

character encoding

File format

A format for encoding information in a file. Each different type of file has a different file format. The file format specifies first whether the file is a binary or ASCII file, and second, how the information is organized.

ASCII

Acronym for the American Standard Code for Information Interchange. Pronounced ask-ee, ASCII is a code for representing English characters as numbers, with each letter assigned a number from 0 to 127. For example, the ASCII code for uppercase M is 77. Most computers use ASCII codes to represent text, which makes it possible to transfer data from one computer to another.

Text files stored in ASCII format are sometimes called ASCII files. Text editors and word processors are usually capable of storing data in ASCII format, although ASCII format is not always the default storage format. Most data files, particularly if they contain numeric data, are not stored in ASCII format. Executable programs are never stored in ASCII format.

The standard ASCII character set uses just 7 bits for each character. There are several larger character sets that use 8 bits, which gives them 128 additional characters. The extra characters are used to represent non-English characters, graphics symbols, and mathematical symbols. Several companies and organizations have proposed extensions for these 128 characters. The DOS operating system uses a superset of ASCII called extended ASCII or high ASCII. A more universal standard is the ISO Latin 1 set of characters, which is used by many operating systems, as well as Web browsers.

Binary Format

A format for representing data used by some applications. The other main formats for storing data are text formats (such as ASCII and EBCDIC), in which each character of data is assigned a specific code number.

Binary formats are used for executable programs and numeric data, whereas text formats are used for textual data. Many files contain a combination of binary and text formats. Such files are usually considered to be binary files even though they contain some data in a text format.

A Character Encoding

A character encoding is the way that letters, digits and other symbols are expressed as numeric values that a computer can understand.

Alternatively referred to as the character set, character code, charset and code page. A character encoding describes a specific encoding for characters; it defines how the bits in a stream of text are mapped to the characters they represent. ASCII is the basis of most code pages; for example, the value for a character “C” is represented by 67 in ASCII.

A file — an HTML document, for instance — is saved with a particular character encoding. Information about the form of encoding that the file uses is sent to browsers and other user agents, so that they can interpret the bits and bytes properly. If the declared encoding doesn’t match the encoding that has actually been used, browsers may render your precious web page as gobbledygook. And of course search engines can’t make head nor tail of it, either.

The choice of character encoding affects the range of literal characters we can use in a web page. Regular Latin letters are rarely a problem, but some languages need more letters than others, and some languages need various diacritical marks above or below the letters. Then, of course, some languages don’t use Latin letters at all. If we want proper — as in typographically correct — punctuation and special symbols, the choice of encoding also becomes more critical.

What if we need a character that cannot be represented with the encoding we’ve chosen? We have to resort to entities or numeric character references (NCR). An entity reference is a symbolic name for a particular character, such as © for the © symbol. It starts with an ampersand (&) and should end with a semicolon (;). An NCR references a character by its code position. The NCR for the copyright symbol is © (decimal) or © (hexadecimal).

Entities or NCRs work just as well as literal characters, but they use more bytes and make the markup more difficult to read. They are also prone to typing errors.

Unicode / ISO 10646

Unicode — a character repertoire that contains most of the characters used in the languages of the world. It can accommodate millions of characters, and already contains hundreds of thousands. A version of Unicode that has been standardised by ISO is called ISO 10646.

A standard for representing characters as integers. Unlike ASCII, which uses 7 bits for each character, Unicode uses 16 bits, which means that it can represent more than 65,000 unique characters. This is a bit of overkill for English and Western-European languages, but it is necessary for some other languages, such as Greek, Chinese and Japanese. Many analysts believe that as the software industry becomes increasingly global, Unicode will eventually supplant ASCII as the standard character-coding format.

UTF

Short for Universal Transformation Format, a method of converting Unicode characters, which are 16 bits each, into 7- or 8-bit characters. UTF-7 converts Unicode into ASCII for transmission over 7-bit mail systems, and UTF-8 converts Unicode to 8-bit bytes.

I would recommend using UTF-8 wherever possible, since it can represent any character in the ISO 10646 repertoire. Even if you only write in English, UTF-8 gives you direct access to typographically correct quotation marks, several dashes, ellipses, and more. And if you need to write in Greek or Japanese, you can do so without having to muck about with entities or NCRs.

Summary

Choosing the right character encoding is important. If you choose an encoding that’s unsuitable for your site (e.g. using ISO 8859-1 for a Chinese site), you’ll need to use lots of entities or NCRs, which will bloat file sizes unnecessarily.

Unfortunately, choosing an encoding isn’t always easy. Lack of support within the various components in the publishing chain can prevent you from using the encoding that would best suit your content.

Use UTF-8 (without a BOM) if at all possible, especially for multilingual sites.

And perhaps the most important thing of all: the encoding you declare must match the encoding you used when saving your files!

kisstopher

Leave a comment

Your email address will not be published. Required fields are marked *