Handling Character Data

Informix Guide to GLS Functionality
SQL Features

Handling Character Data

The GLS feature allows you to put non-ASCII characters (including multibyte characters) in the following parts of an SQL statement:

Quoted strings

Comments

Column substrings

TRIM function arguments

UPPER, LOWER, and INITCAP function strings

Specifying Quoted Strings

You use quoted strings in a variety of SQL statements, particularly data manipulation statements such as SELECT and INSERT. A quoted string is a sequence of characters that is delimited by quotation marks. The quotation marks can be single quotes or double quotes. However, if the DELIMIDENT environment variable is set, the database server interprets a sequence of characters in double quotes as a delimited identifier rather than as a string. For more information about delimited identifiers, see Non-ASCII Characters in Identifiers.

When you use a nondefault locale, you can use any characters in the code set of your locale within a quoted string. If the locale supports a code set with non-ASCII characters, you can use these characters in a quoted string. In the following example, the user inserts column values that include multibyte characters in the table mytable:

INSERT INTO mytable

	VALUES ('A1A2B1B2abcd', '123X1X2Y1Y2', 'efgh')

In this example, the first quoted string includes the multibyte characters A1A2 and B1B2. The second quoted string includes the multibyte characters X1X2 and Y1Y2. The third quoted string contains only single-byte characters. This example assumes that the locale supports a multibyte code set with the A1A2, B1B2, X1X2, and Y1Y2 characters.

For complete information on quoted strings, see the Quoted String segment in the Informix Guide to SQL: Syntax.

Specifying Comments

To use comments after SQL statements, introduce the comment text with one of the following comment symbols:

The double-hyphen (--) complies with the ANSI SQL standard.

Braces ({}) are an Informix extension to the standard.

When you use a nondefault locale, you can use any characters in the code set of your locale within a comment. If the locale supports a code set with non-ASCII characters, you can use these characters in an SQL comment. In the following example, the user inserts a column value that includes multibyte characters in the table mytable:

EXEC SQL insert into mytable values ('A1A2B1B2abcd', '123') -- A1A2 and B1B2 are multibyte characters.

In this example, the SQL comment includes the multibyte characters A1A2 and B1B2. This example assumes that the locale supports a multibyte code set with the A1A2 and B1B2 characters.

For complete information on SQL comments and comment symbols, see the Informix Guide to SQL: Syntax.

Specifying Column Substrings

When you specify a column expression with a character data type in a SELECT statement (or in any other SQL statement that includes an embedded SELECT statement), you can specify that a subset of the data in the column is to be retrieved. A column expression that includes brackets to signify a subset of the data in the column is known as a column substring. The syntax of a column substring is as follows.


Element	Purpose
first	Position of the first byte in the substring
last	Position of the last byte in the substring

Column Substrings in Single-Byte Code Sets

Suppose that you want to retrieve the customer_num column and the seventh through ninth bytes of the lname column from the customer table. To perform this query, use a column substring for the lname column in your SELECT statement, as follows:

SELECT customer_num, lname[7,9] as lname_subset

	FROM customer

	WHERE lname = 'Albertson'

If the value of the lname column is Albertson, the following sample output shows the result of the query.


customer_num	lname_subset
114	son

Because the locale supports a single-byte code set, the preceding query seems to return the seventh through ninth characters of the name Albertson. However, column substrings are byte based, and the query returns the seventh through ninth bytes of the name. Because one byte is equal to one character in single-byte code sets, the distinction between characters and bytes in column substrings is not apparent in these code sets.

Column Substrings in Multibyte Code Sets

For multibyte code sets, column substrings return the specified number of bytes, not number of characters. If a character column multi_col contains a string that consists of three 2-byte characters, this 6-byte string can be represented as follows:

A1A2B1B2C1C2

Suppose that you specified the following column substring for the multi_col column in a query:

multi_col[1,2]

The query returns the following result:

A1A2

The substring that the query returns consists of 2 bytes (1 character), not 2 characters.

To retrieve the first two characters from the multi_col column, specify a column substring in which first is the byte position of the first byte in the first character and last is the byte position of the last byte in the second character. For the 6-byte string A¹A²B¹B²C¹C², you specify this column substring as follows in your query:

multi_col[1,4]

The following result is returned:

A1A2B1B2

The substring that the query returns consists of the first 4 bytes of the column value as you specified. These 4 bytes represent the first two characters in the column.

Partial Characters in Column Substrings

A multibyte character might consist of 2, 3, or 4 bytes. A multibyte character that has lost one or more of its bytes so that the original intended meaning of the character is lost is called a partial character.

Unless prevented, a column substring might truncate a multibyte character or split it up in such a manner that it no longer retains the original sequence of bytes. A partial character might be generated when you use column subscript operators on columns that contain multibyte characters. Suppose that a user specifies the following column substring for the multi_col column where the value of the string in multi_col is A1A2B1B2C1C2:

multi_col[2,5]

The user requests the following bytes in the query: A2B1B2C1. However, if the database server returned this column substring to the user, the first and third characters in the column would be truncated.

Avoidance in a Multibyte Code Set

Informix database servers do not allow partial characters to occur. The GLS feature prevents the database server from returning the specified range of bytes literally when this range contains partial characters. If your database locale supports a multibyte code set and you specify a particular column substring in a query, the database server replaces any truncated multibyte characters with single-byte white spaces.

For example, suppose the multi_col column contains the string A¹A²A³A⁴B¹B²B³B⁴, and you execute the following SELECT statement:

SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A1A2B1B2'

The query indicates that no matching rows were found because the database server converts the substring multi_col[2,4], the string A²A³A⁴, to three single-byte spaces (sss). The WHERE clause of the query specifies the following condition for the search:

WHERE 'sss' = 'A1A2A3'

Because this condition is never true, the query retrieves no matching rows.

Informix database servers replace partial characters in each individual substring operation, even when they are concatenated. For example, suppose the multi_col column contains A¹A²B¹B²C¹C²D¹D², and the WHERE clause contains the following condition:

multi_col[2,4] | multi_col[6,8]

The query does not return any rows because the result of the concatenation (A²B¹B²C²D¹D²) contains two partial characters, A² and C². The Informix database server converts these partial characters to single-byte spaces and creates the following WHERE clause condition:

WHERE 'sB1B2sD1D2' = 'A1A2B1B2'

This condition is also never true, so the query retrieves no matching rows.

Misinterpreting Partial Characters

Partial characters present a problem if the substrings strings can be processed or presented to users in any way that makes their concatenation not reconstruct the original logical string. Possible problem areas include when a substring of one multibyte character is actually a valid character by itself. For example, suppose a multibyte code set contains a 4-byte character, A1A2A3A4, that represents the digit 1 and a 3-byte character, A2A3A4, that represents the digit 6. Suppose also that you use the locale that contains this multibyte code set when you execute the following query:

SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A2A3A4'

The database server interprets multi_col[2,4] as the valid 3-byte character (a multibyte 6) instead of a substring of the valid 4-byte character ('sss'). Therefore, the WHERE clause contains the following condition:

WHERE '6' = '6'

The problem of partial characters does not occur in single-byte code sets because each character is stored in a single byte. When your database locale supports a single-byte code set, and you specify a particular column substring in a query, the database server returns exactly the subset of data that you requested and does not replace any characters with white spaces.

Partial Characters in an ORDER BY Clause

Partial characters might also create a problem when you specify column substrings in an ORDER BY clause of a SELECT statement. The syntax for specifying column substrings in the ORDER BY clause is as follows.


Element	Purpose
column	Name of a column in the specified table or view The query results are sorted by the values contained in this column. A column specified in the `ORDER BY` clause must be listed explicitly or implicitly in the select list of the `SELECT` clause.
first	First byte of the first character in the column substring
last	Last byte of the last character in the column substring

If the locale supports a multibyte code set whose characters are all of the same length, you can use column substrings in an ORDER BY clause. However, the more likely scenario is that your multibyte code set contains characters with varying lengths. In this case, you might not find it useful to specify column substrings in the ORDER BY clause.

For example, suppose that you want to retrieve all rows from the multi_data table, and you want to use the multi_chars column with a column subscript to collate the query results. The following SELECT statement attempts to collate the query results according to the portion of the multi_chars column that is contained in the fourth to sixth characters of the column:

SELECT * FROM multi_data

	ORDER BY multi_chars[7,12]

If the locale supports a multibyte code set whose characters are all 2 bytes in length, you know that the fourth character in the column begins in byte position 7, and the sixth character in the column ends in byte position 12. The preceding SELECT statement does not generate partial characters.

However, if the multibyte code set contains a mixture of single-byte characters, 2-byte characters, and 3-byte characters, the column substring multi_chars[7,12] might create partial characters from the multi_chars data. In this case, you might get unexpected results when you specify a column substring in the ORDER BY clause.

For information on the collation order of different types of character data in the ORDER BY clause, see The ORDER BY Clause. For the complete syntax and usage of the ORDER BY clause, see the SELECT statement in the Informix Guide to SQL: Syntax.

Tip: A partial character might also be generated when a SQL API copies multibyte data from one buffer to another. For more information, see Generating Non-ASCII Filenames. Avoidance in TEXT and BYTE Columns

Partial characters are not a problem when you specify a column substring for a column with the TEXT or BYTE data type. The database server avoids partial characters in TEXT and BYTE columns in the following way:

Because the database server interprets a BYTE column as a series of bytes, not characters, the splitting of multibyte characters as a result of the byte range that a column substring specifies is not an issue.

A column substring for a BYTE column returns the exact range of bytes that is specified and does not replace any bytes with white spaces.

The database server interprets a TEXT column as a series of characters.

A column substring for a TEXT column returns the exact range of bytes that is specified. Attempts to resolve partial characters in TEXT data are resource intensive. Therefore, the database server does not replace any bytes with white spaces. For more information, see The TEXT Data Type.

Warning: The processing and interpretation of TEXT and BYTE data are the responsibility of the client application, which must handle the possibility of partial characters in these operations.

Specifying Arguments to the TRIM Function

The TRIM function is an SQL function that removes leading or trailing pad characters from a character string. By default, this pad character is an ASCII space. If your locale supports a code set that defines a different character as a space, TRIM does not remove this locale-specific space from the front or back of a string. If you specify the LEADING, TRAILING, or BOTH keywords for TRIM, you can define a different pad character. However, you cannot specify a non-ASCII character as a pad character, even if your locale supports a code set that defines the non-ASCII character.

Using Case-Insensitive Search Functions

The SQL search functions UPPER, LOWER, and INITCAP support GLS. They accept multibyte characters in character-type source strings and operate on them. The return type is the same as the type of the source string:

UPPER converts every letter in a string to uppercase.

LOWER converts every letter in a string to lowercase.

INITCAP changes the first letter of a word or series of words to uppercase.

For complete information about these search functions, see the Informix Guide to SQL: Syntax.

Collating Character Data

Collation involves the sorting of the data values in columns that have character data types. For an explanation of collation order and a discussion of the two methods of sorting character data (code-set order and localized order), see Character Classes of the Code Set.

The type of collation order that the database server uses affects the following SQL statements:

CREATE INDEX

SELECT

Collation Order in CREATE INDEX

The CREATE INDEX statement creates an index on one or more columns of a table. The ASC and DESC keywords in the CREATE INDEX statement control whether the index keys are stored in ascending or descending order.

When you use a nondefault locale, the following locale-specific considerations apply to the CREATE INDEX statement:

The index keys are stored in code-set order when you create an index on columns of these data types:

CHAR

LVARCHAR

VARCHAR

For example, if the database stores its database locale as the Japanese SJIS locale (ja_jp.sjis), index keys for a CHAR column in any table of the database are stored in Japanese SJIS code-set order.

When you create an index on an NCHAR or NVARCHAR column, the index keys are stored in localized order.

For example, if the database uses the Japanese SJIS locale, index keys for an NCHAR column in any table of the database are stored in the localized order that the ja_jp.sjis locale defines.

If you use the default locale (U.S. English), the index keys are stored in the code-set order (in ascending or descending sequence) of the default code set regardless of the data type of the character column. Because the default locale does not define a localized order, the database server sorts columns of the following data types in code-set order:

CHAR

LVARCHAR

NCHAR

NVARCHAR

VARCHAR

Collation Order in SELECT Statements

The SELECT statement performs a query on the specified table and retrieves data from the specified columns and rows. Collation order affects the following parts of the SELECT statement:

The ORDER BY clause

The relational operator, BETWEEN, and IN conditions of the WHERE clause

The MATCHES and LIKE conditions of the WHERE clause

The ORDER BY Clause

The ORDER BY clause sorts the retrieved rows by the values that are contained in a column or set of columns. When this clause sorts character columns, the results of the sort depend on the data type of the column, as follows:

Columns that are sorted in code-set order:

CHAR

LVARCHAR

VARCHAR

NCHAR and NVARCHAR columns are sorted in localized order.

Assume that you use a nondefault locale for the client and database locale, and you make a query against the table called abonnés. This SELECT statement specifies three columns of CHAR data type in the select list: numéro (employee number), nom (last name), and prénom (first name).

SELECT numéro,nom,prénom 

	FROM abonnés

	ORDER BY nom;

The statement sorts the query results by the values that are contained in the nom column. Because the nom column that is specified in the ORDER BY clause is a CHAR column, the database server sorts the query results in the code-set order. As the following table shows, names that begin with uppercase letters come before names that begin with lowercase letters, and names that start with an accented letter (Ålesund, Étaix, Ötker, and Øverst) come at the end of the list.

Figure 3-2
Data Set for Code-Set Order of the abonnés Table

numéro nom prénom

13612 Azevedo Edouardo Freire

13606 Dupré Michèle Françoise

13607 Hammer Gerhard

13602 Hämmerle Greta

13604 LaForêt Jean-Noël

13610 LeMaître Héloïse

13613 Llanero Gloria Dolores

13603 Montaña José Antonio

13611 Oatfield Emily

13609 Tiramisù Paolo Alfredo

13600 da Sousa João Lourenço Antunes

13615 di Girolamo Giuseppe

13601 Ålesund Sverre

13608 Étaix Émile

13605 Ötker Hans-Jürgen

13614 Øverst Per-Anders

However, the result of the query is different if the numéro, nom, and prénom columns of the abonnés table are defined as NCHAR rather than CHAR.

Suppose the nondefault locale defines a localized order that collates the data as the following table shows. This localized order defines equivalence classes for uppercase and lowercase letters and for unaccented and accented versions of the same letter.

Figure 3-3
Data Set for Localized Order of the abonnés Table

numéro nom prénom

13612 Azevedo Edouardo Freire

13601 Ålesund Sverre

13600 da Sousa João Lourenço Antunes

13615 di Girolamo Giuseppe

13606 Dupré Michèle Françoise

13608 Étaix Émile

13607 Hammer Gerhard

13602 Hämmerle Greta

13604 LaForêt Jean-Noël

13610 LeMaître Héloïse

13613 Llanero Gloria Dolores

13603 Montaña José Antonio

13611 Oatfield Emily

13605 Ötker Hans-Jürgen

13614 Øverst Per-Anders

13609 Tiramisù Paolo Alfredo

The same SELECT statement now returns the query results in localized order because the nom column that the ORDER BY clause specifies is an NCHAR column.

The SELECT statement supports use of a column substring in an ORDER BY clause. However, you need to ensure that this use for column substrings works with the code set that your locale supports. For more information, see Partial Characters in Column Substrings.

Logical Predicates in a WHERE Clause

The WHERE clause specifies search criteria and join conditions on the data that you want to select. Collation rules affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following logical predicates:

Relational-operator condition

BETWEEN condition

IN condition

EXISTS and ANY conditions

Relational-Operator Conditions

The following SELECT statement assumes a nondefault locale. It uses the less than (<) relational operator to specify that the only rows are to be retrieved from the abonnés table are those in which the value of the nom column is less than Hammer.

SELECT numéro,nom,prénom 

	FROM abonnés

	WHERE nom < 'Hammer';

If nom is a CHAR column, the database server uses code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The following sample of output shows that this SELECT statement retrieves only two rows.


numéro	nom	prénom
13612	Azevedo	Edouardo Freire
13606	Dupré	Michèle Françoise

These two rows are those less than Hammer in the code-set-ordered data set shown in Figure 3-2 on page 3-32.

However, if nom is an NCHAR column, the database server uses localized order to sort the rows that the WHERE clause specifies. The following sample of output shows that this SELECT statement retrieves six rows.


numéro	nom	prénom
13612	Azevedo	Edouardo Freire
13601	Ålesund	Sverre
13600	da Sousa	João Lourenço Antunes
13615	di Girolamo	Giuseppe
13606	Dupré	Michèle Françoise
13608	Étaix	Émile

These six rows are those less than Hammer in the localized-order data set shown in Figure 3-3 on page 3-33.

BETWEEN Conditions

The following SELECT statement assumes a nondefault locale and uses a BETWEEN condition to retrieve only those rows in which the values of the nom column are in the inclusive range of the values of the two expressions that follow the BETWEEN keyword:

SELECT numéro,nom,prénom 

	FROM abonnés

	WHERE nom BETWEEN 'A' AND 'Z';

The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses the code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The following sample output shows the query results.


numéro	nom	prénom
13612	Azevedo	Edouardo Freire
13606	Dupré	Michèle Françoise
13607	Hammer	Gerhard
13602	Hämmerle	Greta
13604	LaForêt	Jean-Noël
13610	LeMaître	Héloïse
13613	Llanero	Gloria Dolores
13603	Montaña	José Antonio
13611	Oatfield	Emily
13609	Tiramisù	Paolo Alfredo

Because the database server uses the code-set order for the nom values, as Figure 3-2 on page 3-32 shows, these query results do not include the following rows:

Rows in which the value of nom begins with a lowercase letter: da Sousa and di Girolamo

Rows with an accented letter: Ålesund, Étaix, Ötker, and Øverst

However, if nom is an NCHAR column, the database server uses localized order to sort the rows. The following sample output shows the query results.

numéro nom prénom

13612 Azevedo Edouardo Freire

13601 Ålesund Sverre

13600 da Sousa João Lourenço Antunes

13615 di Girolamo Giuseppe

13606 Dupré Michèle Françoise

13608 Étaix Émile

13607 Hammer Gerhard

13602 Hämmerle Greta

13604 LaForêt Jean-Noël

13610 LeMaître Héloïse

13613 Llanero Gloria Dolores

13603 Montaña José Antonio

13611 Oatfield Emily

13605 Ötker Hans-Jürgen

13614 Øverst Per-Anders

13609 Tiramisù Paolo Alfredo

Because the database server uses localized order for the nom values, these query results include rows in which the value of nom begins with a lowercase letter or accented letter.

IN Conditions

An IN condition is satisfied when the expression to the left of the IN keyword is included in the parenthetical list of values to the right of the keyword. This SELECT statement assumes a nondefault locale and uses an IN condition to retrieve only those rows in which the value of the nom column is any of the following: Azevedo, Llanero, or Oatfield.

SELECT numéro,nom,prénom 

	FROM abonnés

	WHERE nom IN ('Azevedo', 'Llanero', 'Oatfield');

The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses code-set order, as Figure 3-2 on page 3-32 shows. The database server retrieves rows in which the value of nom is Azevedo, but not rows in which the value of nom is azevedo or Åzevedo because the characters A, a, and Å are not equivalent in the code-set order. The query also returns rows with the nom values of Llanero and Oatfield.

However, if nom is an NCHAR column, the database server uses localized order, as Figure 3-3 on page 3-33 shows, to sort the rows. If the locale defines A, a, and Å as equivalent characters in the localized order, the query returns rows in which the value of nom is Azevedo, azevedo, or Åzevedo. The same selection rule applies to the other names in the parenthetical list that follows the IN keyword.

Comparisons with MATCHES and LIKE Conditions

Collation rules also affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following conditions:

MATCHES condition

LIKE condition

MATCHES Condition

A MATCHES condition tests for matching character strings. The condition is true, or satisfied, when the value of the column to the left of the MATCHES keyword matches the pattern that a quoted string specifies to the right of the MATCHES keyword.You can use wildcard characters in the string. For example, you can use brackets to specify a range of characters. For more information about MATCHES, see the Informix Guide to SQL: Syntax.

When the MATCHES condition does not list a range of characters in the string, it specifies a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:

For CHAR and VARCHAR columns, no collation considerations come into play.

For NCHAR and NVARCHAR columns, collation considerations might come into play because these data types use localized order and the locale might define equivalence classes of collation.

For example, the localized order might specify that a and A are an equivalent class. That is, they have equal weight in the collation sequence. For more information about localized order, see Localized Order.

The examples in the following table illustrate the different results that CHAR and NCHAR columns produce when a user specifies the MATCHES keyword without a range in a SELECT statement. These examples assume use of a nondefault locale that defines A and a in an equivalence class. It also assumes that col1 is a CHAR column and col2 is an NCHAR column in table mytable.


Query	Data Type	Query Results
`SELECT * FROM mytable WHERE col1 MATCHES 'art'`	`CHAR`	All rows in which column col1 contains the value '`art`' with a lowercase a
`SELECT * FROM mytable WHERE col2 MATCHES 'art'`	`NCHAR`	All rows in which column col2 contains the value `'art'` or `'Art'`

When you use the MATCHES keyword to specify a range, collation considerations come into play for all columns with character data types. When the column to the left of the MATCHES keyword is an NCHAR, NVARCHAR, CHAR, or VARCHAR column, and the quoted string to the right of the MATCHES keyword includes brackets to specify a range, the database server uses localized order.

Important: When the database server determines the characters that fall within a range, it always uses the localized order that is specified for the database, even for CHAR and VARCHAR columns. This behavior is an exception to the rule that the database server uses code-set order for all operations on CHAR and VARCHAR columns and localized order for all operations on NCHAR and NVARCHAR columns.

Some simple examples show how the database server treats NCHAR, NVARCHAR, CHAR, and VARCHAR columns when you use the MATCHES keyword with a range in a SELECT statement. Suppose that you want to retrieve from the abonnés table the employee number, first name, and last name for all employees whose last name nom begins in the range of characters E through P. Also assume that the nom column is an NCHAR column. The following SELECT statement uses a MATCHES condition in the WHERE clause to pose this query:

SELECT numéro,nom,prénom 

	FROM abonnés

	WHERE nom MATCHES '[E-P]*'

	ORDER BY nom;

The rows for Étaix, Ötker, and Øverst appear in the query result because, in the localized order, as Figure 3-3 on page 3-33 shows, the accented first letter of each name falls within the E through P MATCHES range for the nom column.


numéro	nom	prénom
13608	Étaix	Émile
13607	Hammer	Gerhard
13602	Hämmerle	Greta
13604	LaForêt	Jean-Noël
13610	LeMaître	Héloïse
13613	Llanero	Gloria Dolores
13603	Montaña	José Antonio
13611	Oatfield	Emily
13605	Ötker	Hans-Jürgen
13614	Øverst	Per-Anders

If nom is a CHAR column, the query result is exactly the same as when nom was an NCHAR column. The database server always uses localized order to determine what characters fall within a range, regardless of whether the column is CHAR or NCHAR.

LIKE Condition

A LIKE condition also tests for matching character strings. As with the MATCHES condition, the LIKE condition is true, or satisfied, when the value of the column to the left of the LIKE keyword matches the pattern that the quoted string specifies to the right of the LIKE keyword. You can use only certain symbols as wildcards in the quoted string. For more information about LIKE, see the Informix Guide to SQL: Syntax.

The LIKE condition can specify only a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:

For CHAR and VARCHAR columns, no collation considerations come into play.

For NCHAR and NVARCHAR columns, collation considerations might come into play because these data types use localized order, and the locale might define equivalence classes of collation. For example, the localized order might specify that a and A are an equivalent class.

The LIKE keyword does not support matches with a range. That is, you cannot use bracketed wildcard characters in LIKE conditions.

Wildcard Characters in LIKE and MATCHES Conditions

Informix products support the following ASCII characters as wildcard characters in the MATCHES and LIKE conditions.


Condition	Wildcard Characters
`LIKE`	_ %
MATCHES	* ? [ ] ^ -

For CHAR and VARCHAR data, the database server performs byte-by-byte comparison for pattern matching in the LIKE and MATCHES conditions. For NCHAR and NVARCHAR data, the database server performs pattern matching in the LIKE and MATCHES conditions based on logical characters, not bytes. Therefore, the _ (underscore) wildcard of the LIKE clause and the ? (question mark) wildcard of the MATCHES clause match any one single-byte or multibyte character, as the following table shows.


Condition	Quoted String	Column Value	Result
LIKE	'ab_d'	`'abcd'`	True
`LIKE`	'ab_d'	`'abA`1`A`2`d'`	True
`MATCHES`	'ab?d'	`'abcd'`	True
`MATCHES`	'ab?d'	`'abA`1`A`2`d'`	True

The database server treats any multibyte character as a literal character. To tell the database server to interpret a wildcard character as its literal meaning, you must precede the character with an escape character. You must use single-byte characters as escape characters; the database server does not recognize use of multibyte characters for this purpose. The default escape character is the backslash (\).

The following use of the MATCHES condition gives a true result for the column value that is shown.


Condition	Quoted String	Column Value	Result
`MATCHES`	'ab\?d'	`'ab?d'`	True

Using SQL Length Functions

You can use SQL length functions in the SELECT statement and other data manipulation statements. Length functions return the length of a column, string, or variable in bytes or characters.

The choice of locale affects the following three SQL length functions:

The LENGTH function

The OCTET_LENGTH function

The CHAR_LENGTH (or CHARACTER_LENGTH) function

For the complete syntax of these functions, see the Expression segment in the Informix Guide to SQL: Syntax.

The LENGTH Function

The LENGTH function returns the number of bytes of data in character data. However, the behavior of the LENGTH function varies with the type of argument that the user specifies. The argument can be a quoted string, a character-type column other than the TEXT data type, a TEXT column, a host variable, or an SPL routine variable.

The following table shows how the LENGTH function operates on each of these argument types. The Example column in this table uses the symbol s to represent a single-byte trailing white space. This table also assumes that the sample strings consist of single-byte characters.


LENGTH Argument	Behavior	Example
Quoted string	Returns number of bytes in string, minus any trailing white spaces as defined in the locale.	If the string is `'Ludwig'`, the result is `6`. If the string is `'Ludwigssss'`, the result is still `6`.
`CHAR`, `VARCHAR`, `NCHAR`, or `NVARCHAR` column	Returns number of bytes in a column, minus any trailing white spaces, regardless of defined length of the column.	If the fname column of the customer table is a `CHAR`(15) column, and this column contains the string `'Ludwig'`, the result is `6`. If the fname column contains the string `'Ludwigssss'`, the result is still `6`.
`TEXT` column	Returns number of bytes in a column, including trailing white spaces.	If the cat_descr column in the catalog table is a `TEXT` column, and this column contains the string `'Ludwig'`, the result is `6`. If the cat_descr column contains the string `'Ludwigssss'`, the result is `10`.
Host or procedure variable	Returns number of bytes that the variable contains, minus any trailing white spaces, regardless of defined length of the variable.	If the procedure variable f_name is defined as `CHAR`(15), and this variable contains the string `'Ludwig'`, the result is `6`. If the f_name variable contains the string `'Ludwigssss'`, the result is still `6`.

With Single-Byte Code Sets

When you use the default locale or any locale with a single-byte code set, the LENGTH function seems to return the number of characters in the column. In the following example, the stores_demo database, which contains the customer table, uses the default code set for the U.S. English locale. Suppose a user enters a SELECT statement with the LENGTH function to display the last name, length of the last name, and customer number for rows where the customer number is less than 106.

SELECT lname AS cust_name, length (fname) AS length, customer_num AS cust_num FROM customer WHERE customer_num < 106

The following sample of output shows the result of the query. For each row that is retrieved, the length column seems to show the number of characters in the lname (cust_name) column. However, the length column actually displays the number of bytes in the lname column. In the default code set, one byte stores one character. For more information about the default code set, see The Default Locale.


cust_name	length	cust_num
Ludwig	6	101
Carole	6	102
Philip	6	103
Anthony	7	104
Raymond	7	105

With Multibyte Code Sets

When you use the LENGTH function in a locale that supports a multibyte code set, such as the Japanese SJIS code set, the distinction between characters and bytes is meaningful. The LENGTH function returns the number of bytes in the column or quoted string, and this result might be different from the number of characters in the string.

The following example assumes that the database that contains the customer_multi table has a database locale that supports a multibyte code set. Suppose that the user enters a SELECT statement with the LENGTH function to display the last name, the length of the last name, and the customer number for the customer whose customer number is 199.

SELECT lname AS cust_name, length (fname) AS length, customer_num AS cust_num FROM customer_multi WHERE customer_num = 199

Assume that the last name (lname) for customer 199 consists of four characters, represented as follows:

aA1A2bB1B2

In this representation, the first character (the symbol a) is a single-byte character. The second character (the symbol A1A2) is a 2-byte character. The third character (the symbol b) is a single-byte character. The fourth character (the symbol B1B2) is a 2-byte character.

The following sample of output shows the result of the query. Although the customer first name consists of 4 characters, the length column shows that the total number of bytes in this name is 6.


cust_name	length	cust_num
aA1A2bB1B2	6	199

The OCTET_LENGTH Function

The OCTET_LENGTH function returns the number of bytes and generally includes trailing white spaces in the byte count. This SQL length function uses the definition of white space that the locale defines. OCTET_LENGTH returns the number of bytes in a character column, quoted string, host variable, or procedure variable. However, the actual behavior of the OCTET_LENGTH function varies with the type of argument that the user specifies.

The following table shows how the OCTET_LENGTH function operates on each of the argument types. The Example column in this table uses the symbol s to represent a single-byte trailing white space. For simplicity, the Example column also assumes that the sample strings consist of single-byte characters.


OCTET_LENGTH Argument	Behavior	Example
Quoted string	Returns number of bytes in string, including any trailing white spaces.	If the string is `'Ludwig'`, the result is `6`. If the string is `'Ludwigssss'`, the result is `10`.
`CHAR` or `NCHAR` column	Returns number of bytes in string, including trailing white spaces. This value is the defined length, in bytes, of the column.	If the fname column of the customer table is a `CHAR`(15) column, and this column contains the string `'Ludwig'`, the result is `15`. If the fname column contains the string '`Ludwigsss`', the result is still `15`.
`VARCHAR` or `NVARCHAR` column	Returns number of bytes in string, including trailing white spaces. This value is the actual length, in bytes, of the character string. It is not the defined maximum column size.	If the cat_advert column of the catalog table is a `VARCHAR`(255, 65) column, and this column contains the string "`Ludwig`", the result is `6`. If the column contains the string '`Ludwigssss`', the result is `10`.
`TEXT` column	Returns number of bytes in column, including trailing white spaces.	If the cat_descr column in the catalog table is a `TEXT` column, and this column contains the string `'Ludwig'`, the result is `6`. If the cat_descr column contains the string `'Ludwigssss'`, the result is `10`.
Host or procedure variable	Returns number of bytes that the variable contains, including any trailing white spaces, regardless of defined length of variable.	If the procedure variable f_name is defined as `CHAR`(15), and this variable contains the string `'Ludwig'`, the result is `6`. If the f_name variable contains the string `'Ludwigssss'`, the result is `10`.

The difference between the LENGTH and OCTET_LENGTH functions is that OCTET_LENGTH generally includes trailing white spaces in the byte count, whereas LENGTH generally excludes trailing white spaces from the byte count.

The advantage of the OCTET_LENGTH function over the LENGTH function is that the OCTET_LENGTH function provides the actual column size whereas the LENGTH function trims the column values and returns the length of the trimmed string. This advantage of the OCTET_LENGTH function applies both to single-byte code sets such as ISO8859-1 and multibyte code sets such as the Japanese SJIS code set.

The following table shows some results that the OCTET_LENGTH function might generate.


OCTET_LENGTH Input String	Description	Result
'abc '	A quoted string with four single-byte characters (the characters `abc` and one trailing space)	4
'A¹A²B¹B²'	A quoted string with two multibyte characters	4
'aA¹A²bB¹B²'	A quoted string with two single-byte and two multibyte characters	6

The CHAR_LENGTH Function

The CHAR_LENGTH function (also known as the CHARACTER_LENGTH function) returns the number of characters in a quoted string, column with a character data type, host variable, or procedure variable. However, the actual behavior of this function varies with the type of argument that the user specifies.

The following table shows how the CHAR_LENGTH function operates on each of the argument types. The Example column in this table uses the symbol s to represent a single-byte trailing white space. For simplicity, the Example column also assumes that the sample strings consist of single-byte characters.


CHAR_LENGTH Argument	Behavior	Example
Quoted string	Returns number of characters in string, including any trailing white spaces as defined in the locale.	If the string is `'Ludwig'`, the result is `6`. If the string is `'Ludwigssss'`, the result is `10`.
`CHAR` or `NCHAR` column	Returns number of characters in string, including trailing white spaces. This value is the defined length, in bytes, of the column.	If the fname column of the customer table is a `CHAR`(15) column, and this column contains the string `'Ludwig'`, the result is `15`. If the fname column contains the string `'Ludwigssss'`, the result is `15`.
`VARCHAR` or `NVARCHAR` column	Returns number of characters in string, including white spaces. This value is the actual length, in bytes, of the character string. It is not the defined maximum column size.	If the cat_advert column of the catalog table is a `VARCHAR`(255, 65), and this column contains the string "`Ludwig`", the result is `6`. If the column contains the string '`Ludwigssss`', the result is `10`.
`TEXT` column	Returns number of characters in column, including trailing white spaces.	If the cat_descr column in the catalog table is a `TEXT` column, and this column contains the string `'Ludwig'`, the result is `6`. If the cat_descr column contains the string `'Ludwigssss'`, the result is `10`.
Host or procedure variable	Returns number of characters that the variable contains, including any trailing white spaces, regardless of defined length of variable.	If the procedure variable f_name is defined as `CHAR`(15), and this variable contains the string `'Ludwig'`, the result is `6`. If the f_name variable contains the string `'Ludwigssss'`, the result is `10`.

The CHAR_LENGTH function is especially useful with multibyte code sets. If a quoted string of characters contains any multibyte characters, the number of characters in the string differs from the number of bytes in the string. You can use the CHAR_LENGTH function to determine the number of characters in the quoted string.

However, the CHAR_LENGTH function can also be useful in single-byte code sets. In these code sets, the number of bytes in a column is equal to the number of characters in the column. If you use the LENGTH function to determine the number of bytes in a column (which is equal to the number of characters in this case), LENGTH trims the column values and returns the length of the trimmed string. In contrast, CHAR_LENGTH does not trim the column values but returns the declared size of the column.

The following table shows some results that the CHAR_LENGTH function might generate for quoted strings.


CHAR_LENGTH Input String	Description	Result
'`abc` '	A quoted string with 4 single-byte characters (the characters `abc` and 1 trailing space)	4
'`A`1`A`2`B`1`B`2'	A quoted string with 2 multibyte characters	2
'`aA`1`A`2`B`1`B`2'	A quoted string with 2 single-byte and 2 multibyte characters	4

Informix Guide to GLS FunctionalitySQL Features