Home | Previous Page | Next Page   GLS Fundamentals > Understanding a GLS Locale >

Code Sets for Character Data

A character set is one or more natural-language alphabets together with additional symbols for digits, punctuation, and diacritical marks. Each character set has at least one code set, which maps its characters to unique bit patterns. These bit patterns are called code points. ASCII, ISO8859-1, Windows Code Page 1252, and EBCDIC are examples of code sets that support the English language.

The number of unique characters in the language determines the amount of storage that each character requires in a code set. Because a single byte can store values in the range 0 to 255, it can uniquely identify 256 characters. Most Western languages have fewer than 256 characters and therefore have code sets made up of single-byte characters. When an application handles data in such code sets, it can assume that 1 byte stores 1 character.

The ASCII code set contains 128 characters. Therefore, the code point for each character requires 7 bits of a byte. These single-byte characters with code points in the range 0 to 128 are sometimes called ASCII or 7-bit characters. The ASCII code set is a single-byte code set and is a subset of all code sets that IBM Informix products support.

If a code set contains more than 128 characters, some of its characters have code points that must set the eighth bit of the byte. These non-ASCII characters might be either of the following types of characters:

Some characters in the Japanese SJIS code set, for another example, are of 2 or 3 bytes. Applications that handle data in multibyte code sets cannot assume that 1 character takes only 1 byte of storage.

Tip:
In this manual, the term "non-ASCII characters" applies to all characters with a code point greater than 127. Non-ASCII characters include both 8-bit and multibyte characters.

IBM Informix products can support single-byte or multibyte code sets. For some examples of GLS locales that support non-ASCII characters, see Supporting Non-ASCII Characters.

Tip:
Throughout this manual, examples show how single-byte and multibyte characters appear. Because multibyte characters are usually ideographic (such as Japanese or Chinese characters), this manual does not use the actual multibyte characters. Instead, it uses ASCII characters to represent both single-byte and multibyte characters. For more information, see Typographical Conventions of the Introduction.
Home | [ Top of Page | Previous Page | Next Page | Contents | Index ]