INFORMIX
INFORMIX-GLS Programmer's Guide
Chapter 2: Character Processing
Home Contents Index Master Index New Book

Types of Characters

A GLS locale supports a particular code set, which maps characters to unique bit patterns called code points. A particular code set can contain the following types of characters:

Tip: For a general introduction to code sets, single-byte characters, and multibyte characters, see the "Informix Guide to GLS Functionality."
Use of the INFORMIX-GLS library helps to remove most assumptions about the type of character that your application handles.

Single-Byte Characters

A single-byte character can hold code-point values from zero to 255. It can use 7 or 8 bits of a byte to represent a character, as follows:

    These characters contain code points in the range 0 to 127.

    Only software that is 8-bit clean can correctly interpret 8-bit characters.

English, European, and Middle Eastern code sets support at most 256 characters. Therefore, code sets that support these languages consist of single-byte characters.

When your application processes only single-byte characters, it can perform string-processing tasks based on the assumption that the number of bytes in a buffer equals the number of characters that the buffer can hold. For single-byte code sets, you can rely on the built-in scaling for array allocation and access that the C compiler provides.

The INFORMIX-GLS functions and macros that handle multibyte characters are optimized for single-byte characters. Use of single-byte characters with these functions does not involve the full algorithms that multibyte-processing involves. (For more information, see "INFORMIX-GLS Optimization".)

Multibyte Characters

A multibyte character can hold code-point values greater than 255. One multibyte character can range from two to four bytes in length. Asian code sets are multibyte code sets; they contain both single-byte and multibyte characters.

If your application processes multibyte characters, it can no longer make the same assumption as for single-byte characters. The number of bytes in a buffer no longer equals the number of characters in the buffer. Because of the potential of varying number of bytes for each character, you can no longer rely on the C compiler to perform the following operations correctly:

Although your application cannot exploit the built-in scaling of the C compiler for multibyte-character strings, it can use the macros and functions of the INFORMIX-GLS library to perform these operations on multibyte-character strings. To process a multibyte character, you cannot pass the entire character to a function. You must pass a pointer to the beginning of the character so that the called function can access the remaining bytes of the character.

Tip: For a list of operations that the functions of the INFORMIX-GLS library can perform on multibyte characters and multibyte-character strings, see "Character Operations" and "String Operations", respectively.
One single-byte assumption can still be applied to multibyte-character strings: no multibyte character has the null byte (0x000) as its second, third, or fourth byte. Therefore, if code is checking for only the single-byte ASCII null character, that code does not need to change to handle multibyte characters. This null character is also the null terminator in a multibyte character.

Tip: Throughout this manual, examples show how single-byte and multibyte characters appear. Because multibyte characters are usually ideographic (such as Japanese or Chinese characters), this manual does not use the actual multibyte characters. Instead, it uses ASCII characters to represent both single-byte and multibyte characters. For more information about how this manual represents multibyte and single-byte characters abstractly, see "Character-Representation Conventions" of the Introduction.
The names of most INFORMIX-GLS functions that handle multibyte characters start with one of the following strings.

Function String What the Function Handles

ifx_gl_mb

A multibyte character

ifx_gl_mbs

A multibyte-character string

For example, the function ifx_gl_mblen() determines the length of a multibyte character while ifx_gl_mbslen() determines the length of a multibyte-character string.

The gl_mchar_t Data Type

The INFORMIX-GLS library represents a multibyte character with the gl_mchar_t data type. The gls.h header file defines the gl_mchar_t data type and the ifxgls.h header file includes gls.h. Therefore, you must include ifxgls.h in any file that uses the gl_mchar_t data type (or any INFORMIX-GLS function).

Important: The gl_mchar_t data type is an opaque data type. Do not access the individual bytes of a multibyte character directly.
Because any character in a multibyte-character string might contain many bytes, and gl_mchar_t refers to only one of those bytes, you usually declare a multibyte variable as a pointer to a gl_mchar_t data type. For example, the following declaration creates the mb_string variable as a pointer to a multibyte-character string:

The preceding declaration assumes that the application allocates memory for the mb_string multibyte string elsewhere. For information on how to allocate memory for multibyte-character strings, see "Multibyte-Character-String Allocation".

You can also cast a C string to a multibyte-character string, as follows:

Wide Characters

The INFORMIX-GLS functions and macros that handle multibyte characters use special multibyte-processing algorithms to determine the size of multibyte characters. However, the overhead of these full multibyte-processing algorithms can be significant. Therefore, the INFORMIX-GLS library provides support for wide characters as an alternative form for the processing of multibyte characters. Wide characters allow you to rely on the C compiler built-in scaling instead of the multibyte-processing algorithms.

A wide-character form of a code set involves the normalization of the size of each multibyte character so that each character is the same size. This size must be equal to or greater than the largest character that an operating system can support, and it must match the size of an integer data type that the C compiler can scale (such as short int, int, and long int).

The names of most INFORMIX-GLS functions that handle wide characters start with one of the following strings.

Function String What the Function Handles

ifx_gl_wc

A wide character

ifx_gl_wcs

A wide-character string

For example, the function ifx_gl_wctomb() converts a wide character to a multibyte character, and ifx_gl_wcslen() determines the length of a wide-character string.

The gl_wchar_t Data Type
The INFORMIX-GLS library represents a wide character with the gl_wchar_t data type. The gls.h header file defines the gl_wchar_t data type, and the ifxgls.h header file includes gls.h. Therefore, you must include ifxgls.h in any file that uses the gl_wchar_t data type (or any INFORMIX-GLS function).

Important: The gl_wchar_t data type is an opaque data type. Do not access the individual bytes of a wide character directly.
The gl_wchar_t data type is a fixed-length structure. Therefore, you can declare a variable as a pointer to a gl_wchar_t structure or as a gl_wchar_t structure directly. For example, the following declarations create the wc_string variable as a pointer to a wide-character string and wc_string2 as a single wide character:

The declaration of wc_string assumes that the application allocates memory for this wide-character string elsewhere. The declaration of wc_string2 allocates one wide character. For information on how to allocate memory for wide-character strings, see "Wide-Character String Allocation".

You can compare or assign a single-byte ASCII character or character constant to a single wide character, as in the following code fragment:

Conversion Between Multibyte and Wide Characters
To use wide characters, you convert multibyte characters to their wide-character equivalents, process the characters, and convert the wide characters back to their multibyte equivalents. The INFORMIX-GLS library supports conversion between a multibyte form of a code set and its wide-character form. Unlike code-set conversion, the actual integral value of each character does not change in this conversion.

To change all character data to wide characters, you must first locate the character data and then find all the places where it is assigned and passed to functions. INFORMIX-GLS functions perform the following tasks to convert between multibyte and wide characters:

For more information on how to decide whether wide characters are appropriate for your application, see "Wide-Character Processing".




INFORMIX-GLS Programmer's Guide, version 9.1
Copyright © 1998, Informix Software, Inc. All rights reserved.