Performing Code-Set Conversion

Home | Previous Page | Next Page GLS Fundamentals > Using GLS Locales with IBM Informix Products >

Performing Code-Set Conversion

In a client/server environment, character data might need to be converted from one code set to another if the client or server computer uses different code sets to represent the same characters. The conversion of character data from one code set (the source code set) to another (the target code set) is called code-set conversion. Without code-set conversion, one computer cannot correctly process or display character data that originates on the other (when the two computers use different code sets).

IBM Informix products use GLS locales to perform code-set conversion. Both an IBM Informix client application and a database server might perform code-set conversion. For details, see Database Server Code-Set Conversion and Client Application Code-Set Conversion.

You specify a code set as part of the GLS locale. At runtime, IBM Informix products adhere to the following rules to determine which code sets to use:

The client application uses the client code set, which the CLIENT_LOCALE environment variable specifies, to write all files on the client computer and to interact with all client I/O devices.
The database server uses the database code set, which the DB_LOCALE environment variable specifies, to transfer data to and from the database.
The database server uses the server code set, which the SERVER_LOCALE environment variable specifies, to write files (such as debug and warning files).

Code-set conversion does not provide either of the following capabilities:

Code-set conversion is not a semantic translation.
It does not convert between words in different languages. For example, it does not convert from the English word yes to the French word oui. It only ensures that each character retains its meaning when it is processed or written, regardless of how it is encoded.
Code-set conversion does not create a character in the target code set if it exists only in the source code set.
For example, if the character â is passed to a target computer whose code set does not contain that character, the target computer cannot process or print the character exactly.

For each character in the source code set, a corresponding character in the target code set should exist. However, if the source code set contains characters that are not in the target code set, the conversion must then define how to map these mismatched characters to the target code set. (Absence of a mapping between a character in the source and target code sets is often called a lossy error.) If all characters in the source code set exist in the target code set, mismatch handling does not apply.

A code-set conversion uses one of the following four methods to handle mismatched characters:

Round-trip conversion
This method maps each mismatched character to a unique character in the target code set so that the return mapping maps the original character back to itself. This method guarantees that a two-way conversion results in no loss of information; however, data that is converted just one way might prevent correct processing or printing on the target computer.
Substitution conversion
This method maps all mismatched characters to one character in the target code set that highlights mismatched characters. This method guarantees that a one-way conversion clearly shows the mismatched characters; however, a two-way conversion results in loss of information if mismatched characters are present.
Graphical-replacement conversion
This method maps each mismatched character to a character in the target code set that looks similar to the source character.

This method includes the mapping of one-character ligatures to their two-character equivalents and vice versa, to make printing of mismatched data more accurate on the target computer, but it most likely confuses the processing of this data on the target computer.
A hybrid of two or three of the preceding conversion methods

Tip:

Each code-set-conversion source file (.cv) indicates how the associated conversion handles mismatched characters. For information on code-set-conversion files, see Appendix A. Managing GLS Files.

When Code-Set Conversion Is Performed

An application must use code-set conversion only if the two code sets (client and server-processing locale, or server-processing locale and server) are different. The following situations are possible causes of code sets that differ:

Different operating systems might encode the same characters in different ways.
For example, the code for the character â (a-circumflex) in Windows Code Page 1252 is hexadecimal 0xE2. In IBM Coded Character Set Identifier (CCSID) 437 (a common IBM UNIX code set), the code is hexadecimal 0x83. If the code for â on the client is sent unchanged to the IBM UNIX computer, it prints as the Greek character g (gamma). This action occurs because the code for g is hexadecimal 0xE2 on the IBM UNIX computer.

Tip:

IBM Informix products support IBM CCSID code-set numbers, a system of 16-bit numbers that uniquely identify the coded graphic character representations. For more information, see Appendix A. Managing GLS Files.

One language can have several code sets. Each might represent a subset of the language.
For example, the code sets ccdc and big5 are both internal representations of a subset of the Chinese language. These subsets, however, include different numbers of Chinese characters.

Important:

GLS fully supports the unified Chinese GB18030-2000 code set, including all characters in the Unicode Basic Multilingual Plane (BMP) and in the extended planes.

If a code-set conversion is required for data transfer from computer A to computer B, then it is also required for data transfer from computer B to computer A. In the client/server environment, the following situations might require code-set conversion:

If the client locale and database locale specify different code sets, the client application performs code-set conversion so that the server computer is not loaded with this type of processing. For more information, see Client Application Code-Set Conversion.
If the server locale and server-processing locale specify different code sets, the database server performs code-set conversion when it writes to and reads from operating-system files such as log files. For more information, see Database Server Code-Set Conversion.

In Figure 4, the black dots indicate the two points in a client/server environment at which code-set conversion might occur.

Figure 4. Points of GLS Code-Set Conversion

begin figure description - This figure is described in the surrounding text. - end figure description

In the example connection that Figure 4 shows, the ESQL/C client application performs code-set conversion on the data that it sends to and receives from the database server if the client and database code sets are convertible. The Informix database server also performs code-set conversion when it writes to a message-log file if the code sets of the server locale and server-processing locale are convertible.