INFORMIX
INFORMIX-GLS Programmer's Guide
Chapter 2: Character Processing
Home Contents Index Master Index New Book

Character Operations

The INFORMIX-GLS library supports the following character operations on multibyte and wide characters:

For more information about any of the INFORMIX-GLS functions that this section describes, see the function descriptions in Chapter 4, "INFORMIX-GLS Function Descriptions."

Character Classification

A GLS locale groups the characters of a code set into character classes. Each class contains characters that have a related purpose. The contents of a character class can be language specific. For example, the lower class contains all alphabetic lowercase characters in a code set. In the default locale, the default code set groups the English characters a through z into the lower class, but it also includes lowercase characters such as á, è, î, õ, and ü.

UNIX
The default code set on UNIX platforms is ISO8859-1.

WIN NT/95
The default code set for Windows environments is Microsoft 1252.

Tip: For more information on the default locale and the default code set, see the "Informix Guide to GLS Functionality."
The LC_CTYPE category of a GLS locale file defines the following character classes.

(1 of 2)

Character Class Contains

alpha

Alphabetic characters:

lower

Lowercase alphabetic characters:

upper

Uppercase alphabetic characters:

digit

Single-byte decimal digits 0 through 9

xdigit

Hexadecimal digits:

alnum

All characters in both the alpha and digit classes.

blank

Horizontal white space:

space

Horizontal and vertical white space:

cntrl

Control characters:

graph

Graphical characters are all characters that have visual representation. This class includes characters in the alpha, lower, upper, digit, xdigit, and punct classes.

punct

Punctuation:

print

All printable characters

This class includes characters in the alpha, lower, upper, digit, xdigit, graph, and punct classes.

To be internationalized, your application must not assume which characters belong in a particular character class. For example, it must not contain code such as the following to determine whether a character is lowercase:

Instead, use functions in the INFORMIX-GLS library to identify the class of a particular character. Figure 2-1 lists the GLS character classes and the INFORMIX-GLS functions that test for these classes, for both multibyte and wide characters.

Figure 2-1
INFORMIX-GLS Character-Class Functions
Character Class Multibyte-Character Function Wide-Character Function

alnum (alpha or digit)

ifx_gl_ismalnum()

ifx_gl_iswalnum()

alpha

ifx_gl_ismalpha()

ifx_gl_iswalpha()

lower

ifx_gl_ismlower()

ifx_gl_iswlower()

upper

ifx_gl_ismupper()

ifx_gl_iswupper()

blank

ifx_gl_ismblank()

ifx_gl_iswblank()

space

ifx_gl_ismspace()

ifx_gl_iswspace()

digit

ifx_gl_ismdigit()

ifx_gl_iswdigit()

xdigit

ifx_gl_ismxdigit()

ifx_gl_iswxdigit()

cntrl

ifx_gl_ismcntrl()

ifx_gl_iswcntrl()

graph

ifx_gl_ismgraph()

ifx_gl_iswgraph()

punct

ifx_gl_ismpunct()

ifx_gl_iswpunct()

print

ifx_gl_ismprint()

ifx_gl_iswprint()

These INFORMIX-GLS functions check the LC_CTYPE category of the current locale to determine whether a specified character belongs to the respective character classification. The following code fragment uses the ifx_gl_ismlower() function to perform the internationalized determination of whether a multibyte character is lowercase:

The INFORMIX-GLS functions in Figure 2-1 do not return a unique value if they encounter an error. To detect an error condition, initialize the ifx_gl_lc_errno() error number to zero before you call one of these functions, and then call ifx_gl_lc_errno() immediately after you call the function. For example, the following code fragment performs error checking for the ifx_gl_ismlower() function:

Case Conversion

In many languages, alphabetic characters have an uppercase and lowercase representation. To be internationalized, your application must not assume the case equivalent for a particular character. For example, it must not contain code such as the following to obtain the uppercase equivalent of the character in lower_char:

The preceding line works for the English characters of the ASCII code set. However, it does not work for 8-bit characters, such as à and À.

To handle case conversion in your application, use functions in the INFORMIX-GLS library to obtain the case equivalent of a particular character. Figure 2-2 lists the case-conversion operations and the INFORMIX-GLS functions that perform them, both the multibyte functions and their wide-character equivalents.

Figure 2-2
INFORMIX-GLS Case-Conversion Functions
Case-Conversion Operation Multibyte-Character
Function
Wide-Character
Function

Obtain the lowercase equivalent of the source character

ifx_gl_tomlower()

ifx_gl_towlower()

Obtain the uppercase equivalent of the source character

ifx_gl_tomupper()

ifx_gl_towupper()

The INFORMIX-GLS case-conversion functions check the LC_CTYPE category of the current locale to determine the case equivalent of a source character. If the desired case equivalent exists, the functions return an integer value that is the alphabetic case equivalent of the source character. If no case equivalent exists, these functions return the source character.

The following code fragment uses the ifx_gl_tomupper() function to perform the internationalized case conversion of a multibyte character:

Case Conversion for Multibyte Characters

The ifx_gl_tomlower() and ifx_gl_tomupper() functions require three arguments:

For a multibyte-character string, the size of the case-converted string might not equal the size of the unconverted string. Therefore, to perform case conversion on multibyte characters, you must take the following special processing steps:

Determining When to Allocate a Destination Buffer
Whether you can perform case conversion of multibyte characters in place depends on whether the number of bytes written to the destination buffer is the same as the number of bytes read from the source, as follows:

If you cannot perform case conversion in place, you must allocate a separate destination buffer. To allocate this buffer, you need to have an estimate of the number of bytes that it needs to hold. Use any of the following methods to determine the number of bytes that might be written to the destination buffer:

    This function applies to both uppercase and lowercase conversions. The second argument to ifx_gl_case_conv_outbuflen() is the number of bytes in the character source.

    This value is always greater than or equal to (>=) the value that the ifx_gl_case_conv_outbuflen() function returns.

    This value is always greater than or equal to the value that the ifx_gl_mb_loc_max() function returns.

Of the preceding options, the macro IFX_GL_MB_MAX is the fastest and the only method that can initialize static buffers. The function ifx_gl_case_conv_outbuflen() is the slowest but the most precise.

The following code fragment uses the ifx_gl_mblen() function to determine the size of the source character and the ifx_gl_case_conv_outbuflen() function to determine the estimated size of the case-converted value:

For more information on the IFX_GL_NO_LIMIT constant, see "Multibyte-Character Termination".

Determining Number of Bytes Read and Written
The ifx_gl_tomupper() and ifx_gl_tomlower() functions return an unsigned short integer that encodes the information about the number of bytes that the function has read. The INFORMIX-GLS library provides the following macros to obtain this information from the return value.
INFORMIX-GLS Macro Information Obtained

IFX_GL_CASE_CONV_SRC_BYTES()

The number of bytes read from the source string

IFX_GL_CASE_CONV_DST_BYTES()

The number of bytes written to the destination buffer

The following code fragment uses the ifx_gl_tomlower() function to convert a multibyte character to its lowercase equivalent. It uses the case-conversion macros to obtain the number of bytes read and written during the case-conversion operation:

The memory-management rules for case conversion of a single multibyte character also apply to converting a string of one or more multibyte characters. For example, the following code fragment converts a multibyte-character string to its uppercase equivalent:

Case Conversion for Wide Characters

Because a wide character has a fixed size, the ifx_gl_towlower() and ifx_gl_towupper() functions require only one argument: the wide character to convert. These functions return an integer value of the case-equivalent character for this wide character. Therefore, you can always perform case conversion of wide characters in place. For example, you can assign the case equivalent of src_wc back to src_wc, as follows:

You can also perform case conversion of wide characters into a destination buffer. The previous line could also be written as follows:

Exception Handling

These case-conversion functions do not return a special value if they encounter an error. To detect an error condition, initialize the ifx_gl_lc_errno() error number to zero before you call one of these functions and check ifx_gl_lc_errno() immediately after you call it. The following code fragment performs exception handling in the conversion of a wide character to its lowercase equivalent:

Performance Issues

The INFORMIX-GLS case-conversion functions assign the destination character regardless of whether the source character has a case-equivalent character. If no case equivalent for a particular source characters exists, the functions return only the source character. Therefore, the following two algorithms perform the same task:

However, the first approach is usually faster.

Code-Set Conversion

A character might be encoded differently on two different operating systems. Therefore, the appropriate communication layer must be prepared to convert between the two encodings. This process of conversion between two code sets is called code-set conversion. Code-set conversion translates code points from a source code set into a destination (or target) code set.

E/C
INFORMIX-ESQL/C applications automatically perform any needed code-set conversion when they send and receive database data.

DB API
DataBlade client applications automatically perform any needed code-set conversion when they send and receive database data. However, DataBlade UDRs do not automatically perform code-set conversion.

Tip: For an introduction to code-set conversion, see the "Informix Guide to GLS Functionality."
If your application might need to perform code-set conversion, it must:

Determining If Code-Set Conversion Is Needed

The ifx_gl_conv_needed() function determines whether characters encoded in a source code set require conversion to a destination code set. Use this function to determine if code-set conversion is needed. Simply comparing the names of the code sets does not provide enough information to determine if it is necessary. In the ifx_gl_conv_needed() function, you can specify the source and destination code sets as any of the following:

For more information on these options, see "Specifying Code-Set Names".

Performing Code-Set Conversion

The ifx_gl_cv_mconv() function performs code-set conversion on multibyte-character strings. You can specify the source and destination code sets as any of the following:

For more information on these options, see "Specifying Code-Set Names".

For a multibyte-character string, the size of the converted string might not equal the size of the unconverted string. Therefore, to perform code-set conversion on multibyte characters, you must take the following special processing steps:

Determining When to Allocate a Destination Buffer
Whether you can perform code-set conversion on multibyte characters in place depends on whether the number of bytes written to the destination buffer is the same as the number of bytes read from the source, as follows:

If you cannot perform code-set conversion in place, you must allocate a separate destination buffer. To allocate a destination buffer, you need to have an estimate of the number of bytes that it needs to hold. You can use any of the following methods to determine the number of bytes that might be written to the destination buffer:

    The third argument to ifx_gl_cv_outbuflen() is the number of bytes in the character source.

    The src_bytesleft value references the number of bytes to convert. This expression value is always greater than or equal to the expression value that uses the ifx_gl_mb_loc_max() function.

Of the preceding options, the expression that uses the macro IFX_GL_MB_MAX is the fastest and the only one that can be used to initialize static buffers. The ifx_gl_case_conv_outbuflen() function is the slowest but the most precise.

The following code fragment uses the ifx_gl_cv_outbuflen() function to determine the estimated size of a code-set-conversion destination buffer:

For more information on the conv_state_t structure, see "Preserving State Information".

Specifying Code-Set Names
You can specify the names of the source and destination (target) code sets with any of the following methods:

    You can find the names of code sets in code-set name registry.

WIN NT/95

%INFORMIXDIR%\gls\cmZ

UNIX

$INFORMIXDIR/gls/cmZ

    This macro specifies use of the code set of the current processing locale. Depending on the context, the value of IFX_GL_PROC_CS is based on either the client environment or the database that the database server is currently accessing.

The preceding formats are valid as code-set names in any of the following INFORMIX-GLS functions:

Preserving State Information
Most code sets are not state dependent; that is, the characters of these code sets can be decoded with only one algorithm, and each byte sequence represents a unique character. In contrast, byte sequences in state-dependent code sets can represent more than one character. Which character a sequence represents depends on the current state. State-dependent code sets occur primarily on IBM mainframe computers, and they only affect code-set conversion.

When you fragment a complete source string into two or more nonadjacent source buffers, you must call the ifx_gl_cv_mconv() function multiple times, to perform code-set conversion on each fragment of the string. Because of the nature of state-dependent code sets (and because the caller of this function cannot know whether either the source or destination code set is a state-dependent code set), you must preserve state information across the multiple calls of ifx_gl_cv_mconv(). The ifx_gl_cv_mconv() argument state is used for this purpose.

The state argument is a pointer to a conv_state_t structure. This structure contains two fields that you must set to indicate that you are performing code-set conversion on fragmented strings: first_frag and last_frag. The following table lists the different fragments of a string and the corresponding values to which you must set these two conv_state_t fields.
String Fragment Value of first_frag Field Value of last_frag Field

String is the first of n fragments.

1

0

String is the 2nd, ..., nth-1 fragment.

0

0

String is the last (nth) fragment.

0

1

String is not fragmented; it is a complete string.

1

1

Important: The conv_state_t structure contains other fields that are for internal use only. Informix does not guarantee that these other internal fields of conv_state_t will not change in future releases. Therefore, to create portable code, set only the first_frag and last_frag fields of the conv_state_t structure.
Pass the fragments to the ifx_gl_cv_mconv() function in the same order in which they appear in the complete string. Use the same conv_state_t structure for all of the fragments of the same complete string.

The following code performs code-set conversion on a complete character string that is not fragmented:

This code assigns both the first_frag and last_frag fields a value of one (1) to indicate that the multibyte string is not fragmented.

Suppose that you have a complete multibyte-character string that is fragmented into four different buffers. The following code performs code-set conversion on this fragmented string:

For an additional issue in the processing of fragmented multibyte character strings, see "Fragmenting Multibyte Strings".

Performance Issues

Most performance overhead in code-set conversion is a result of either memory management or multibyte-string traversal. However, only if one of the code sets is a multibyte code set does code-set conversion require this overhead to convert correctly. If the code-set conversion is between two single-byte code sets, you can obtain a code-set conversion table and avoid this overhead.

The following sample code uses the ifx_gl_cv_sb2sb_table() function to obtain a code-set conversion table for two single-byte code sets:




INFORMIX-GLS Programmer's Guide, version 9.1
Copyright © 1998, Informix Software, Inc. All rights reserved.