Informix Guide to GLS Functionality

Informix Guide to GLS Functionality
Chapter 3: SQL Features

Home Contents Index Master Index New Book

Handling Character Data

The GLS feature allows you to put non-ASCII characters (including multibyte characters) in the following parts of an SQL statement:

Quoted strings
Comments
Column substrings
TRIM function arguments

Specifying Quoted Strings

You use quoted strings in a variety of SQL statements, particularly data manipulation statements such as SELECT and INSERT. A quoted string is a sequence of characters that are delimited by quotation marks. The quotation marks can be single quotes or double quotes. However, if the DELIMIDENT environment variable is set, the database server interprets a sequence of characters in double quotes as a delimited identifier rather than as a string. (For more information about delimited identifiers, see page 3-10.)

When you use a nondefault locale, you can use any characters in the code set of your locale within a quoted string. If the locale supports a code set with non-ASCII characters, you can use these characters in a quoted string. In the following example, the user inserts column values that include multibyte characters into the table mytable:

INSERT INTO mytable

	VALUES ('A1A2B1B2abcd', '123X1X2Y1Y2', 'efgh')

In this example, the first quoted string includes the multibyte characters A1A2 and B1B2. The second quoted string includes the multibyte characters X1X2 and Y1Y2. The third quoted string contains only single-byte characters. This example assumes that the locale supports a multibyte code set with the A1A2, B1B2, X1X2, and Y1Y2 characters.

For complete information on quoted strings, see the Quoted String segment in the Informix Guide to SQL: Syntax.

Specifying Comments

To use comments after SQL statements, introduce the comment text with one of the following comment symbols:

The double dash (--) complies with the ANSI SQL standard.
Curly brackets ({}) are an Informix extension to the standard.

When you use a nondefault locale, you can use any characters in the code set of your locale within a comment. If the locale supports a code set with non-ASCII characters, you can use these characters in an SQL comment. In the following example, the user inserts a column value that includes multibyte characters into the table mytable:

EXEC SQL insert into mytable

	values ('A1A2B1B2abcd', '123')      -- A1A2 and B1B2 are multibyte characters.

In this example, the SQL comment includes the multibyte characters A1A2 and B1B2. This example assumes that the locale supports a multibyte code set with the A1A2 and B1B2 characters.

For complete information on SQL comments and comment symbols, see the Introduction to the Informix Guide to SQL: Syntax.

Specifying Column Substrings

When you specify a column expression with a character data type in a SELECT statement (or in any other SQL statement that includes an embedded SELECT statement), you can specify that a subset of the data in the column is to be retrieved. A column expression that includes brackets to signify a subset of the data in the column is known as a column substring. The syntax of a column substring is as follows:

column_name[first, last]

In this format, first is the position of the first byte in the substring, and last is the position of the last byte in the substring. For the complete syntax of column expressions, see the Expression segment in the Informix Guide to SQL: Syntax.

Column Substrings in Single-Byte Code Sets

Suppose that you want to retrieve the customer_num column and the seventh through ninth bytes of the lname column from the customer table. To perform this query, use a column substring for the lname column in your SELECT statement, as follows:

SELECT customer_num, lname[7,9] as lname_subset

	FROM customer

	WHERE lname = 'Albertson'

If the value of the lname column is Albertson, the following sample output shows the result of the query.


customer_num	lname_subset
114	son

Because the locale supports a single-byte code set, the preceding query seems to return the seventh through ninth characters of the name Albertson. However, column substrings are byte based, and the query returns the seventh through ninth bytes of the name Albertson. Because one byte is equal to one character in single-byte code sets, the distinction between characters and bytes in column substrings is not apparent in these code sets.

Column Substrings in Multibyte Code Sets

For multibyte code sets, column substrings return the specified number of bytes, not number of characters. If a character column multi_col contains a string that consists of three 2-byte characters, this 6-byte string can be represented as follows:

A1A2B1B2C1C2

Suppose that you specified the following column substring for the multi_col column in a query:

multi_col[1,2]

The query returns the following result:

A1A2

The substring that the query returns consists of 2 bytes (1 character), not two characters.

To retrieve the first two characters from the multi_col column, specify a column substring in which first is the byte position of the first byte in the first character and last is the byte position of the last byte in the second character. For the 6-byte string A1A2B1B2C1C2, you specify this column substring as follows in your query:

multi_col[1,4]

The following result is returned:

A1A2B1B2

The substring that the query returns consists of the first 4 bytes of the column value as you specified. These 4 bytes represent the first two characters in the column.

Partial Characters in Column Substrings

A multibyte character might consist of 2, 3, or 4 bytes. A multibyte character that has lost one or more of its bytes so that the original intended meaning of the character is lost is called a partial character.

Unless prevented, a column substring might truncate a multibyte character or split it up in such a manner that it no longer retains the original sequence of bytes. A partial character might be generated when you use column subscript operators on columns that contain multibyte characters. Suppose that a user specifies the following column substring for the multi_col column where the value of the string in multi_col is A1A2B1B2C1C2:

multi_col[2,5]

The user requests the following bytes in the query: A2B1B2C1. However, if the database server returned this column substring to the user, the first and third characters in the column would be truncated.

Avoidance in a Multibyte Code Set

Informix database servers do not allow partial characters to occur. The GLS feature prevents the database server from returning the specified range of bytes literally when this range contains partial characters. If your database locale supports a multibyte code set and you specify a particular column substring in a query, the database server replaces any truncated multibyte characters with single-byte white spaces.

For example, suppose the multi_col column contains the string A1A2A3A4B1B2B3B4 and you execute the following SELECT statement:

SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A1A2B1B2'

The query indicates that no matching rows were found because the database server converts the substring multi_col[2,4], the string A2A3A4, to three single-byte spaces (sss). The WHERE clause of the query specifies the following condition for the search:

WHERE 'sss' = 'A1A2A3'

Because this condition is never true, the query retrieves no matching rows.

Informix database servers replace partial characters in each individual substring operation, even when they are concatenated. For example, suppose the multi_col column contains A1A2B1B2C1C2D1D2, and the WHERE clause contains the following condition:

multi_col[2,4] | multi_col[6,8]

The query does not return any rows because the result of the concatenation (A2B1B2C2D1D2) contains two partial characters, A2 and C2. The Informix database server converts these partial characters to single-byte spaces, and creates the following WHERE clause condition:

WHERE 'sB1B2sD1D2' = 'A1A2B1B2

This condition is also never true, so the query retrieves no matching rows.

Misinterpreting Partial Characters

Partial characters present a problem if the substrings strings can be processed or presented to users in any way that makes their concatenation not reconstruct the original logical string. Possible problem areas include when a substring of one multibyte character is actually a valid character by itself. For example, suppose a multibyte code set contains a 4-byte character, A1A2A3A4, that represents the digit 1 and a 3-byte character, A2A3A4, that represents the digit 6. Suppose also that you use the locale that contains this multibyte code set when you execute the following query:

SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A2A3A4'

The database server interprets multi_col[2,4] as the valid 3-byte character (a multibyte 6) instead of a substring of the valid 4-byte character ('sss'). Therefore, the WHERE clause contains the following condition:

WHERE '6' = '6'

The problem of partial characters does not occur in single-byte code sets because each character is stored in a single byte. When your database locale supports a single-byte code set, and you specify a particular column substring in a query, the database server returns exactly the subset of data that you requested and does not replace any characters with white spaces.

Partial Characters in an ORDER BY Clause

Partial characters might also create a problem when you specify column substrings in an ORDER BY clause of a SELECT statement. The general format for specifying column substrings in the ORDER BY clause is as follows:

ORDER BY column-name [first, last]

In this format, first represents the first byte of the first character in the column substring, and last represents the last byte of the last character in the column substring.

If the locale supports a multibyte code set whose characters are all of the same length, you can use column substrings in an ORDER BY clause. However, the more likely scenario is that your multibyte code set contains characters with varying lengths. In this case, you might not find it useful to specify column substrings in the ORDER BY clause.

For example, suppose that you want to retrieve all rows from the multi_data table, and that you want to use the multi_chars column with a column subscript to collate the query results. The following SELECT statement attempts to collate the query results according to the portion of the multi_chars column that is contained in the fourth to sixth characters of the column:

SELECT * FROM multi_data

	ORDER BY multi_chars[7,12]

If the locale supports a multibyte code set whose characters are all 2 bytes in length, then you know that the fourth character in the column begins in byte position 7, and the sixth character in the column ends in byte position 12. The preceding SELECT statement does not generate partial characters.

However, if your multibyte code set contains a mixture of single-byte characters, 2-byte characters, and 3-byte characters, the column substring multi_chars[7,12] might create partial characters from the multi_chars data. In this case, you might get unexpected results when you specify a column substring in the ORDER BY clause.

For information on the collation order of different types of character data in the ORDER BY clause, see "The ORDER BY Clause". For the complete syntax and usage of the ORDER BY clause, see the SELECT statement in the Informix Guide to SQL: Syntax.

Tip: A partial character might also be generated when an ESQL program copies multibyte data from one buffer to another. For more information, see "Avoiding Partial Characters".

Avoidance in BYTE and TEXT Columns

Partial characters are not a problem when you specify a column substring for a column with the BYTE or TEXT data type. The database server avoids partial characters in BYTE and TEXT columns in the following way:

Because the database server interprets a BYTE column as a series of bytes, not characters, the splitting of multibyte characters as a result of the byte range that a column substring specifies is not an issue.

BYTE

The database server interprets a TEXT column as a series of characters.

A column substring for a TEXT column returns the exact range of bytes that is specified. Attempts to resolve partial characters in TEXT data are resource intensive. Therefore, the database server does not replace any bytes with white spaces. For more information on the TEXT data type, see

page 3-18

Warning: Because your application handles the interpretation of BYTE and TEXT data, it must handle possible occurrences of partial characters in multibyte blob data.

Specifying Arguments to the TRIM Function

The TRIM function is an SQL function that removes leading or trailing pad characters from a character string. By default, this pad character is an ASCII space. If your locale supports a code set that defines a different character as a space, TRIM does not remove this locale-specific space from the front or back of a string. If you specify the LEADING, TRAILING, or BOTH keywords for TRIM, you can define a different pad character. However, you cannot specify a non-ASCII character as a pad character, even if your locale supports a code set that defines the non-ASCII character.

Informix Guide to GLS FunctionalityChapter 3: SQL Features Home Contents Index Master Index New Book

Handling Character Data

Specifying Quoted Strings

Informix Guide to GLS Functionality
Chapter 3: SQL Features

Home Contents Index Master Index New Book