INFORMIX-GLS Programmer's Guide

INFORMIX-GLS Programmer's Guide
Chapter 2: Character Processing

Home Contents Index Master Index New Book

Character Operations

The INFORMIX-GLS library supports the following character operations on multibyte and wide characters:

Character classification
Case conversion
Code-set conversion

For more information about any of the INFORMIX-GLS functions that this section describes, see the function descriptions in Chapter 4, "INFORMIX-GLS Function Descriptions."

Character Classification

A GLS locale groups the characters of a code set into character classes. Each class contains characters that have a related purpose. The contents of a character class can be language specific. For example, the lower class contains all alphabetic lowercase characters in a code set. In the default locale, the default code set groups the English characters a through z into the lower class, but it also includes lowercase characters such as á, è, î, õ, and ü.

UNIX

The default code set on UNIX platforms is ISO8859-1.

WIN NT/95

The default code set for Windows environments is Microsoft 1252.

To be internationalized, your application must not assume which characters belong in a particular character class. For example, it must not contain code such as the following to determine whether a character is lowercase:

if ( one_char >= 'a' && one_char <= 'z' ) /* NOT internationalized! */

Instead, use functions in the INFORMIX-GLS library to identify the class of a particular character. Figure 2-1 lists the GLS character classes and the INFORMIX-GLS functions that test for these classes, for both multibyte and wide characters.

Figure 2-1
INFORMIX-GLS Character-Class Functions

Character Class Multibyte-Character Function Wide-Character Function
alnum (alpha or digit)
ifx_gl_ismalnum()
ifx_gl_iswalnum()

alpha
ifx_gl_ismalpha()
ifx_gl_iswalpha()

lower
ifx_gl_ismlower()
ifx_gl_iswlower()

upper
ifx_gl_ismupper()
ifx_gl_iswupper()

blank
ifx_gl_ismblank()
ifx_gl_iswblank()

space
ifx_gl_ismspace()
ifx_gl_iswspace()

digit
ifx_gl_ismdigit()
ifx_gl_iswdigit()

xdigit
ifx_gl_ismxdigit()
ifx_gl_iswxdigit()

cntrl
ifx_gl_ismcntrl()
ifx_gl_iswcntrl()

graph
ifx_gl_ismgraph()
ifx_gl_iswgraph()

punct
ifx_gl_ismpunct()
ifx_gl_iswpunct()

print
ifx_gl_ismprint()
ifx_gl_iswprint()

These INFORMIX-GLS functions check the LC_CTYPE category of the current locale to determine whether a specified character belongs to the respective character classification. The following code fragment uses the ifx_gl_ismlower() function to perform the internationalized determination of whether a multibyte character is lowercase:

if ( ifx_gl_ismlower(one_char, char_size) ) /* IS internationalized! */

The INFORMIX-GLS functions in Figure 2-1 do not return a unique value if they encounter an error. To detect an error condition, initialize the ifx_gl_lc_errno() error number to zero before you call one of these functions, and then call ifx_gl_lc_errno() immediately after you call the function. For example, the following code fragment performs error checking for the ifx_gl_ismlower() function:

/* Initialize the error number */
ifx_gl_lc_errno() = 0;

/* Determine if 'mb' character is lowercase */
value = ifx_gl_ismlower(mb, mb_size);

/* If the error number has changed, ifx_gl_ismlower() has
  * set it to indicate the cause of an error */
if ( ifx_gl_lc_errno() != 0 )
	/* Handle error */
else if ( value != 0 )
	/* Character 'mb' is in lower class */
else if ( value == 0 )
	/* Character 'mb' is NOT in lower class */

Case Conversion

In many languages, alphabetic characters have an uppercase and lowercase representation. To be internationalized, your application must not assume the case equivalent for a particular character. For example, it must not contain code such as the following to obtain the uppercase equivalent of the character in lower_char:

upper_char = lower_char - 'a' + 'A'; /* NOT internationalized! */

The preceding line works for the English characters of the ASCII code set. However, it does not work for 8-bit characters, such as à and À.

To handle case conversion in your application, use functions in the INFORMIX-GLS library to obtain the case equivalent of a particular character. Figure 2-2 lists the case-conversion operations and the INFORMIX-GLS functions that perform them, both the multibyte functions and their wide-character equivalents.

Figure 2-2
INFORMIX-GLS Case-Conversion Functions

Case-Conversion Operation Multibyte-Character
Function Wide-Character
Function
Obtain the lowercase equivalent of the source character
ifx_gl_tomlower()
ifx_gl_towlower()

Obtain the uppercase equivalent of the source character
ifx_gl_tomupper()
ifx_gl_towupper()

The INFORMIX-GLS case-conversion functions check the LC_CTYPE category of the current locale to determine the case equivalent of a source character. If the desired case equivalent exists, the functions return an integer value that is the alphabetic case equivalent of the source character. If no case equivalent exists, these functions return the source character.

The following code fragment uses the ifx_gl_tomupper() function to perform the internationalized case conversion of a multibyte character:

ret = ifx_gl_tomupper(lower_char, upper_char, char_sz) /* IS internationalized! */

Case Conversion for Multibyte Characters

The ifx_gl_tomlower() and ifx_gl_tomupper() functions require three arguments:

The multibyte character or string to convert
The destination buffer for the converted multibyte character or string
The number of bytes to read to obtain a single multibyte character

For a multibyte-character string, the size of the case-converted string might not equal the size of the unconverted string. Therefore, to perform case conversion on multibyte characters, you must take the following special processing steps:

Determine whether you need to allocate a separate destination buffer; if a destination buffer is needed, determine its size.
Determine the number of bytes that have been read and written by the case-conversion process.

Determining When to Allocate a Destination Buffer

Whether you can perform case conversion of multibyte characters in place depends on whether the number of bytes written to the destination buffer is the same as the number of bytes read from the source, as follows:

If the ifx_gl_case_conv_outbuflen() function determines that the size of the source string and its case-converted value are exactly equal, you can perform case conversion in place.
If the size of the case-converted value of the source string is not the same as the size of the source string itself, you cannot perform case conversion in place.

If you cannot perform case conversion in place, you must allocate a separate destination buffer. To allocate this buffer, you need to have an estimate of the number of bytes that it needs to hold. Use any of the following methods to determine the number of bytes that might be written to the destination buffer:

The ifx_gl_case_conv_outbuflen() function calculates either exactly the number of bytes that will be written to the destination buffer or a close over-approximation of the number.

ifx_gl_case_conv_outbuflen()

The ifx_gl_mb_loc_max() function calculates the maximum number of bytes that can be written to the destination buffer for any source value in the current locale.

ifx_gl_case_conv_outbuflen()

The macro IFX_GL_MB_MAX returns the maximum number of bytes that can be written to the destination buffer for any source value in any locale.

ifx_gl_mb_loc_max()

Of the preceding options, the macro IFX_GL_MB_MAX is the fastest and the only method that can initialize static buffers. The function ifx_gl_case_conv_outbuflen() is the slowest but the most precise.

The following code fragment uses the ifx_gl_mblen() function to determine the size of the source character and the ifx_gl_case_conv_outbuflen() function to determine the estimated size of the case-converted value:

/* Obtain the sizes of the source and destination strings */
src_mb_bytes = ifx_gl_mblen(src_mb, ...);
dst_mb_bytes = ifx_gl_case_conv_outbuflen(src_mb_bytes);

if ( dst_mb_bytes == src_mb_bytes )
/* Sizes of source and case-equivalent characters are the

  * same. Perform the case conversion in place */
	{
	retval = 

		ifx_gl_tomupper(src_mb, src_mb, IFX_GL_NO_LIMIT);
	}
else
/* Sizes of source and destination characters are NOT the

  * same. Allocate a destination buffer and perform case

  * conversion into this buffer */
	{
	dst_mb = (gl_mchar_t *) malloc(dst_mb_bytes);
	retval = 

		ifx_gl_tomupper(dst_mb, src_mb, IFX_GL_NO_LIMIT);
	}

For more information on the IFX_GL_NO_LIMIT constant, see "Multibyte-Character Termination".

Determining Number of Bytes Read and Written

The ifx_gl_tomupper() and ifx_gl_tomlower() functions return an unsigned short integer that encodes the information about the number of bytes that the function has read. The INFORMIX-GLS library provides the following macros to obtain this information from the return value.


INFORMIX-GLS Macro	Information Obtained
`IFX_GL_CASE_CONV_SRC_BYTES()`	The number of bytes read from the source string
`IFX_GL_CASE_CONV_DST_BYTES()`	The number of bytes written to the destination buffer

The following code fragment uses the ifx_gl_tomlower() function to convert a multibyte character to its lowercase equivalent. It uses the case-conversion macros to obtain the number of bytes read and written during the case-conversion operation:

/* Initialize source pointer, 'src_mb', to beginning of the

  * multibyte string. Initialize destination pointer to

  * beginning of destination buffer */
src_mb = src_mbs;
dst_mb = dst_mbs;

/* Traverse source string until the null terminator is 

  * reached */
while ( *src_mb != '\0' )
	{
/* Convert source multibyte character, 'src_mb', to lowercase

  * and put in the destination buffer */
	unsigned short retval = 

		ifx_gl_tomlower(dst_mb, src_mb, src_mbs_bytes);
...
/* Increment the source pointer by the number of bytes that

  * have been read and the destination pointer by the number

  * of bytes that have been written */
	src_mb += IFX_GL_CASE_CONV_SRC_BYTES(retval);
	dst_mb += IFX_GL_CASE_CONV_DST_BYTES(retval);
	src_mbs_bytes -= IFX_GL_CASE_CONV_SRC_BYTES(retval);
	}

The memory-management rules for case conversion of a single multibyte character also apply to converting a string of one or more multibyte characters. For example, the following code fragment converts a multibyte-character string to its uppercase equivalent:

/* Assume src_mbs is null terminated */
src_mbs_bytes = strlen(src_mbs);
dst_mbs_bytes = ifx_gl_case_conv_outbuflen(src_mbs_bytes);

if ( dst_mbs_bytes == src_mbs_bytes )
	{
	/* If two strings have the same size, overwrite each

	  * multibyte character in the 'src_mbs' multibyte string

	  * with its uppercase equivalent */
	src_mb = src_mbs;

	while ( *src_mb != '\0' )
		{
		retval = 

			ifx_gl_tomupper(src_mb,src_mb,IFX_GL_NO_LIMIT);
		src_mb += IFX_GL_CASE_CONV_SRC_BYTES(retval);
		}
	}
else
	{
	/* Two strings are not the same size, so must allocate a

	  * destination buffer whose size is determined by the

	  * ifx_gl_case_conv_outbuflen() function */
	dst_mbs = (gl_mchar_t *) malloc(dst_mbs_bytes + 1);

	src_mb = src_mbs;
	dst_mb = dst_mbs;

	while ( *src_mb != '\0' )
		{
		retval = 

			ifx_gl_tomupper(dst_mb,src_mb,IFX_GL_NO_LIMIT);
		src_mb += IFX_GL_CASE_CONV_SRC_BYTES(retval);
		dst_mb += IFX_GL_CASE_CONV_DST_BYTES(retval);
		}

	*dst_mb = '\0';
	}

Case Conversion for Wide Characters

Because a wide character has a fixed size, the ifx_gl_towlower() and ifx_gl_towupper() functions require only one argument: the wide character to convert. These functions return an integer value of the case-equivalent character for this wide character. Therefore, you can always perform case conversion of wide characters in place. For example, you can assign the case equivalent of src_wc back to src_wc, as follows:

src_wc = ifx_gl_towupper(src_wc);

You can also perform case conversion of wide characters into a destination buffer. The previous line could also be written as follows:

dst_wc = ifx_gl_towupper(src_wc);

Exception Handling

These case-conversion functions do not return a special value if they encounter an error. To detect an error condition, initialize the ifx_gl_lc_errno() error number to zero before you call one of these functions and check ifx_gl_lc_errno() immediately after you call it. The following code fragment performs exception handling in the conversion of a wide character to its lowercase equivalent:

/* Initialize the error number */
ifx_gl_lc_errno() = 0;

/* Perform conversion of 'src_wc' to lowercase */
dst_wc = ifx_gl_towlower(src_wc);

/* If the error number has changed, ifx_gl_towlower() has set

  * it to indicate the cause of an error */
if ( ifx_gl_lc_errno() != 0 )
	/* Handle error */
else
	...

int dstbytes;
gl_mchar_t *dstmbs;
conv_state_t state;

dstbytes = ifx_gl_cv_outbuflen("ujis", "sjis", srcbytes);
dstmbs = (gl_mchar_t *) malloc(dstbytes);

state.first_frag = 1;
state.last_frag = 1;
if (ifx_gl_cv_mconv(state, &dstmbs, &dstbytes, "ujis"
	&srcmbs, &srcbytes, "sjis") == -1 )

For more information on the conv_state_t structure, see "Preserving State Information".

Specifying Code-Set Names

You can specify the names of the source and destination (target) code sets with any of the following methods:

Locale names

For example, you can use de_de.8859-1 for the German locale or ja_jp.ujis for the Japanese UJIS locale. For more information on locale names, see the

Informix Guide to GLS Functionality

Code-set names

WIN NT/95

The code-set name registry has the following location:

%INFORMIXDIR%\gls\cmZ

UNIX

The code-set name registry has the following location:

$INFORMIXDIR/gls/cmZ

In the preceding pathnames, INFORMIXDIR is the environment variable that specifies the directory in which you install the Informix product, and Z represents the version number for the code-set object-file format.

The IFX_GL_PROC_CS macro

IFX_GL_PROC_CS

The preceding formats are valid as code-set names in any of the following INFORMIX-GLS functions:

ifx_gl_conv_needed()
ifx_gl_cv_mconv()
ifx_gl_cv_outbuflen()
ifx_gl_cv_sb2sb_table()

Preserving State Information

Most code sets are not state dependent; that is, the characters of these code sets can be decoded with only one algorithm, and each byte sequence represents a unique character. In contrast, byte sequences in state-dependent code sets can represent more than one character. Which character a sequence represents depends on the current state. State-dependent code sets occur primarily on IBM mainframe computers, and they only affect code-set conversion.

When you fragment a complete source string into two or more nonadjacent source buffers, you must call the ifx_gl_cv_mconv() function multiple times, to perform code-set conversion on each fragment of the string. Because of the nature of state-dependent code sets (and because the caller of this function cannot know whether either the source or destination code set is a state-dependent code set), you must preserve state information across the multiple calls of ifx_gl_cv_mconv(). The ifx_gl_cv_mconv() argument state is used for this purpose.

The state argument is a pointer to a conv_state_t structure. This structure contains two fields that you must set to indicate that you are performing code-set conversion on fragmented strings: first_frag and last_frag. The following table lists the different fragments of a string and the corresponding values to which you must set these two conv_state_t fields.

String Fragment Value of first_frag Field Value of last_frag Field
String is the first of n fragments.
1
0

String is the 2nd, ..., nth-1 fragment.
0
0

String is the last (nth) fragment.
0
1

String is not fragmented; it is a complete string.
1
1

Important: The conv_state_t structure contains other fields that are for internal use only. Informix does not guarantee that these other internal fields of conv_state_t will not change in future releases. Therefore, to create portable code, set only the first_frag and last_frag fields of the conv_state_t structure.

Pass the fragments to the ifx_gl_cv_mconv() function in the same order in which they appear in the complete string. Use the same conv_state_t structure for all of the fragments of the same complete string.

The following code performs code-set conversion on a complete character string that is not fragmented:

int unfrag_strng(out_str, out_len, out_cs, in_str, 

		in_len, in_cs)
	gl_mchar_t *out_str;
	int out_len;
	char *out_cs;
	gl_mchar_t *in_str;
	int in_len;
	char *in_cs;
{
	conv_state_t state;
	int ret;

	state.first_frag = 1;
	state.last_frag = 1;
	ret = ifx_gl_cv_mconv(&state, &out_str, &out_len, 

			out_cs, &in_str, &in_len, in_cs);
	...
}

This code assigns both the first_frag and last_frag fields a value of one (1) to indicate that the multibyte string is not fragmented.

Suppose that you have a complete multibyte-character string that is fragmented into four different buffers. The following code performs code-set conversion on this fragmented string:

int frag_strng(out_str, out_len, out_cs, in_str, 

		in_len, in_cs)
	gl_mchar_t *out_str;
	int out_len;
	char *out_cs;
	gl_mchar_t *in_str[];
	int in_len;
	char *in_cs;
{
	conv_state_t state;
	int ret;

/* Perform code-set conversion on the FIRST fragment:

  * first_frag = 1; last_frag = 0 */
	state.first_frag = 1;
	state.last_frag = 0;
	ret = ifx_gl_cv_mconv(&state, &out_str, &out_len, out_cs, 

		&in_str[0], &in_len, in_cs);
	...
/* Perform code-set conversion on the SECOND fragment:

	first_frag = 0; last_frag = 0 */
	state.first_frag = 0;
	state.last_frag = 0;
	ret = ifx_gl_cv_mconv(&state, &out_str, &out_len, out_cs, 

		&in_str[1], &in_len, in_cs);
	...
/* Perform code-set conversion on the THIRD fragment. 

  * No need to set the first_frag and last_frag fields again,

  * because they are already 0 */
	ret = ifx_gl_cv_mconv(&state, &out_str, &out_len, out_cs,
		&in_str[2], &in_len, in_cs);
	...

/* Perform code-set conversion on the FOURTH (last) 

  * fragment: first_frag = 0; last_frag = 1 */
	state.first_frag = 0;
	state.last_frag = 1;
	ret = ifx_gl_cv_mconv(&state, &out_str, &out_len, out_cs,
		&in_str[3], &in_len, in_cs);
	...
}

For an additional issue in the processing of fragmented multibyte character strings, see "Fragmenting Multibyte Strings".

Performance Issues

Most performance overhead in code-set conversion is a result of either memory management or multibyte-string traversal. However, only if one of the code sets is a multibyte code set does code-set conversion require this overhead to convert correctly. If the code-set conversion is between two single-byte code sets, you can obtain a code-set conversion table and avoid this overhead.

The following sample code uses the ifx_gl_cv_sb2sb_table() function to obtain a code-set conversion table for two single-byte code sets:

void do_codeset_conversion(src, src_codeset, dst,

		dst_codeset)
	unsigned char *src;
	char *src_codeset;
	unsigned char *dst;
	char *dst_codeset;
{
	unsigned char *table;

	if ( ifx_gl_cv_sb2sb_table(dst_codeset, 

			src_codeset, &table) == -1 )
		/* Handle Error */

	if ( table != NULL )
		{
		/* Convert in place */
		for ( ; *src != '\0'; src++ ) *src = table[*src];
			dst = src;
		}
	else
		{
		/* Full GLS code-set conversion, which handles

		  * multibyte conversions and complex conversions
		  * between single-byte code sets */
		...

		}
}

INFORMIX-GLS Programmer's GuideChapter 2: Character Processing Home Contents Index Master Index New Book

Character Operations

Character Classification

Case Conversion

Case Conversion for Multibyte Characters

Determining When to Allocate a Destination Buffer

Determining Number of Bytes Read and Written

Case Conversion for Wide Characters

Exception Handling

Performance Issues

Code-Set Conversion

Determining If Code-Set Conversion Is Needed

Performing Code-Set Conversion

Determining When to Allocate a Destination Buffer

Specifying Code-Set Names

Preserving State Information

Performance Issues

INFORMIX-GLS Programmer's Guide
Chapter 2: Character Processing

Home Contents Index Master Index New Book