INFORMIX
Informix Guide to GLS Functionality
Chapter 3: SQL Features
Home Contents Index Master Index New Book

Collating Character Data

Collation involves the sorting of the data values in columns that have character data types. For an explanation of collation order and a discussion of the two methods of sorting character data (code-set order and localized order), see "The Collation Order".

The type of collation order that the database server uses affects the following SQL statements:

Collation Order in CREATE INDEX

The CREATE INDEX statement creates an index on one or more columns of a table. The ASC and DESC keywords in the CREATE INDEX statement control whether the index keys are stored in ascending or descending order.

When you use a nondefault locale, the following locale-specific considerations apply to the CREATE INDEX statement:

    For example, if the database stores its database locale as the Japanese SJIS locale (ja_jp.sjis), index keys for a CHAR column in any table of the database are stored in Japanese SJIS code-set order.

    For example, if the database is branded with the Japanese SJIS locale, index keys for an NCHAR column in any table of the database are stored in the localized order that the ja_jp.sjis locale defines.

If you use the default locale (U.S. English), the index keys are stored in the code-set order (in ascending or descending sequence) of the default code set regardless of the data type of the character column. Because the default locale does not define a localized order, the database server sorts NCHAR and NVARCHAR columns, as well as CHAR and VARCHAR columns, in code-set order.

For a complete description of the CREATE INDEX statement, see the Informix Guide to SQL: Syntax.

Collation Order in SELECT Statements

The SELECT statement performs a query on the specified table and retrieves data from the specified columns and rows. Collation order affects the following parts of the SELECT statement:

For a complete description of the SELECT statement, see the Informix Guide to SQL: Syntax.

The ORDER BY Clause

The ORDER BY clause sorts the retrieved rows by the values that are contained in a column or set of columns. When this clause sorts character columns, the results of the sort depend on the data type of the column, as follows:

Assume that you use a nondefault locale for the client and database locale, and you make a query against the table called abonnés. The following SELECT statement specifies three columns of CHAR data type in the select list: numéro (employee number), nom (last name), and prénom (first name):

The statement sorts the query results by the values that are contained in the nom column. Because the nom column that is specified in the ORDER BY clause is a CHAR column, the database server sorts the query results in the code-set order. As Figure 3-2 shows, names that begin with uppercase letters come before names that begin with lowercase letters, and names that start with an accented letter (Ålesund, Étaix, Ötker, and Øverst) come at the end of the list.

Figure 3-2
Data Set for Code-Set Order of the abonné Table
numéro nom prénom

13612

Azevedo

Edouardo Freire

13606

Dupré

Michèle Françoise

13607

Hammer

Gerhard

13602

Hämmerle

Greta

13604

LaForêt

Jean-Noël

13610

LeMaître

Héloïse

13613

Llanero

Gloria Dolores

13603

Montaña

José Antonio

13611

Oatfield

Emily

13609

Tiramisù

Paolo Alfredo

13600

da Sousa

João Lourenço Antunes

13615

di Girolamo

Giuseppe

13601

Ålesund

Sverre

13608

Étaix

Émile

13605

Ötker

Hans-Jürgen

13614

Øverst

Per-Anders

However, the result of the query is different if the numéro, nom, and prénom columns of the abonnés table are defined as NCHAR rather than CHAR.

Suppose the nondefault locale defines a localized order that collates the data as Figure 3-3 shows. This localized order defines equivalence classes for uppercase and lowercase letters and for unaccented and accented versions of the same letter.

Figure 3-3
Data Set for Localized Order of the abonné Table
numéro nom prénom

13612

Azevedo

Edouardo Freire

13601

Ålesund

Sverre

13600

da Sousa

João Lourenço Antunes

13615

di Girolamo

Giuseppe

13606

Dupré

Michèle Françoise

13608

Étaix

Émile

13607

Hammer

Gerhard

13602

Hämmerle

Greta

13604

LaForêt

Jean-Noël

13610

LeMaître

Héloïse

13613

Llanero

Gloria Dolores

13603

Montaña

José Antonio

13611

Oatfield

Emily

13605

Ötker

Hans-Jürgen

13614

Øverst

Per-Anders

13609

Tiramisù

Paolo Alfredo

The same SELECT statement now returns the query results in localized order because the nom column that the ORDER BY clause specifies is an NCHAR column.

The SELECT statement supports use of a column substring in an ORDER BY clause. However, you need to ensure that this use for column substrings works with the code set that your locale supports. For more information, see "Partial Characters in Column Substrings".

Logical Predicates in a WHERE Clause

The WHERE clause specifies search criteria and join conditions on the data that you want to select. Collation rules affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following logical predicates:

Relational-Operator Conditions
The following SELECT statement assumes a nondefault locale. It uses the less than (<) relational operator to specify that only those rows are to be retrieved from the abonnés table in which the value of the nom column is less than Hammer.

If nom is a CHAR column, the database server uses code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The following sample of output shows that this SELECT statement retrieves only two rows.
numéro nom prénom

13612

Azevedo

Edouardo Freire

13606

Dupré

Michèle Françoise

These two rows are those less than Hammer in the code-set-ordered data set shown in Figure 3-2.

However, if nom is an NCHAR column, the database server uses localized order to sort the rows that the WHERE clause specifies. The following sample of output shows that this SELECT statement retrieves six rows.
numéro nom prénom

13612

Azevedo

Edouardo Freire

13601

Ålesund

Sverre

13600

da Sousa

João Lourenço Antunes

13615

di Girolamo

Giuseppe

13606

Dupré

Michèle Françoise

13608

Étaix

Émile

These six rows are those less than Hammer in the localized-order data set shown in Figure 3-3.

BETWEEN Conditions
The following SELECT statement assumes a nondefault locale and uses a BETWEEN condition to retrieve only those rows in which the values of the nom column are in the inclusive range of the values of the two expressions that follow the BETWEEN keyword:

The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses the code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The following sample output shows the query results.

(1 of 2)

numéro nom prénom

13612

Azevedo

Edouardo Freire

13606

Dupré

Michèle Françoise

13607

Hammer

Gerhard

13602

Hämmerle

Greta

13604

LaForêt

Jean-Noël

13610

LeMaître

Héloïse

13613

Llanero

Gloria Dolores

13603

Montaña

José Antonio

13611

Oatfield

Emily

13609

Tiramisù

Paolo Alfredo

Because the database server uses the code-set order for the nom values (see Figure 3-2), these query results do not include the following rows:

However, if nom is an NCHAR column, the database server uses localized order to sort the rows. The following sample output shows the query results.

(1 of 2)

numéro nom prénom

13612

Azevedo

Edouardo Freire

13601

Ålesund

Sverre

13600

da Sousa

João Lourenço Antunes

13615

di Girolamo

Giuseppe

13606

Dupré

Michèle Françoise

13608

Étaix

Émile

13607

Hammer

Gerhard

13602

Hämmerle

Greta

13604

LaForêt

Jean-Noël

13610

LeMaître

Héloïse

13613

Llanero

Gloria Dolores

13603

Montaña

José Antonio

13611

Oatfield

Emily

13605

Ötker

Hans-Jürgen

13614

Øverst

Per-Anders

13609

Tiramisù

Paolo Alfredo

Because the database server uses localized order for the nom values, these query results include rows in which the value of nom begins with a lowercase letter or accented letter.

IN Conditions
An IN condition is satisfied when the expression to the left of the IN keyword is included in the parenthetical list of values to the right of the keyword. The following SELECT statement assumes a nondefault locale and uses an IN condition to retrieve only those rows in which the value of the nom column is any of the following: Azevedo, Llanero, or Oatfield.

The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses code-set order (see Figure 3-2). The database server retrieves rows in which the value of nom is Azevedo, but not rows in which the value of nom is azevedo or Åzevedo because the characters A, a, and Å are not equivalent in the code-set order (see Figure 3-2). The query also returns rows with the nom values of Llanero and Oatfield.

However, if nom is an NCHAR column, the database server uses localized order (see Figure 3-3) to sort the rows. If the locale defines A, a, and Å as equivalent characters in the localized order, the query returns rows in which the value of nom is Azevedo, azevedo, or Åzevedo. The same selection rule applies to the other names in the parenthetical list that follows the IN keyword.

Comparisons with MATCHES and LIKE Conditions

Collation rules also affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following conditions:

MATCHES Condition
A MATCHES condition tests for matching character strings. The condition is true, or satisfied, when the value of the column to the left of the MATCHES keyword matches the pattern that a quoted string specifies to the right of the MATCHES keyword.You can use wildcard characters in the string. For example, you can use brackets to specify a range of characters. (For more information on MATCHES, see the Condition Segment in Chapter 1 of the Informix Guide to SQL: Syntax.)

When the MATCHES condition does not list a range of characters in the string, it specifies a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:

The examples in the following table illustrate the different results that CHAR and NCHAR columns produce when a user specifies the MATCHES keyword without a range in a SELECT statement. These examples assume use of a nondefault locale that defines A and a in an equivalence class. It also assumes that col1 is a CHAR column and col2 is an NCHAR column in table mytable.
Query Data Type Query Results

SELECT * FROM mytable WHERE col1 MATCHES 'art'

CHAR

All rows in which column col1 contains the value 'art' with a lowercase a

SELECT * FROM mytable WHERE col2 MATCHES 'art'

NCHAR

All rows in which column col2 contains the value 'art' or 'Art'

When you use the MATCHES keyword to specify a range, collation considerations come into play for all columns with character data types. When the column to the left of the MATCHES keyword is an NCHAR, NVARCHAR, CHAR, or VARCHAR column, and the quoted string to the right of the MATCHES keyword includes brackets to specify a range, the database server uses localized order.

Important: When the database server determines the characters that fall within a range, it always uses the localized order that is specified for the database, even for CHAR and VARCHAR columns. This behavior is an exception to the rule that the database server uses code-set order for all operations on CHAR and VARCHAR columns and uses localized order for all operations on NCHAR and NVARCHAR columns.
Some simple examples show how the database server treats NCHAR, NVARCHAR, CHAR, and VARCHAR columns when you use the MATCHES keyword with a range in a SELECT statement. Suppose that you want to retrieve from the abonnés table the employee number, first name, and last name for all employees whose last name nom begins in the range of characters E through P. Also assume that the nom column is an NCHAR column. The following SELECT statement uses a MATCHES condition in the WHERE clause to pose this query:

The rows for Étaix, Ötker, and Øverst appear in the query result because, in the localized order (Figure 3-3), the accented first letter of each name falls within the E through P MATCHES range for the nom column.
numéro nom prénom

13608

Étaix

Émile

13607

Hammer

Gerhard

13602

Hämmerle

Greta

13604

LaForêt

Jean-Noël

13610

LeMaître

Héloïse

13613

Llanero

Gloria Dolores

13603

Montaña

José Antonio

13611

Oatfield

Emily

13605

Ötker

Hans-Jürgen

13614

Øverst

Per-Anders

If nom is a CHAR column, the query result is exactly the same as when nom was an NCHAR column. The database server always uses localized order to determine what characters fall within a range, regardless of whether the column is CHAR or NCHAR.

LIKE Condition
A LIKE condition also tests for matching character strings. As with the MATCHES condition, the LIKE condition is true, or satisfied, when the value of the column to the left of the LIKE keyword matches the pattern that the quoted string specifies to the right of the LIKE keyword. You can only use certain symbols as wildcards in the quoted string. (For more information on LIKE, see the Condition Segment in Chapter 1 of the Informix Guide to SQL: Syntax.)

The LIKE condition can only specify a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:

The LIKE keyword does not support matches with a range. That is, you cannot use bracketed wildcard characters in LIKE conditions.

Wildcard Characters in LIKE and MATCHES Conditions
Informix products support the following ASCII characters as wildcard characters in the MATCHES and LIKE conditions.
Clause Wildcard Characters

LIKE

_ %

MATCHES

* ? [ ] ^ -

For CHAR and VARCHAR data, the database server performs byte-by-byte comparison for pattern matching in the LIKE and MATCHES conditions. For NCHAR and NVARCHAR data, the database server performs pattern matching in the LIKE and MATCHES conditions based on characters, not bytes. Therefore, the _ (underscore) wildcard of the LIKE clause and the ? wildcard of the MATCHES clause match any one single-byte or multibyte character, as the following table shows.
Condition Quoted String Column Value Result

LIKE

'ab_d'

'abcd'

True

LIKE

'ab_d'

'abA1A2d'

True

MATCHES

'ab?d'

'abcd'

True

MATCHES

'ab?d'

'abA1A2d'

True

The database server treats any multibyte character as a literal character. To tell the database server to interpret a wildcard character as its literal meaning, you must precede the character with an escape character. You must use single-byte characters as escape characters; the database server does not recognize use of multibyte characters for this purpose. The default escape character is the backslash (\).

The following use of the MATCHES clause gives a true result for the column value that is shown.
Clause Quoted String Column Value Result

MATCHES

'ab\?d'

'ab?d'

True




Informix Guide to GLS Functionality, version 9.1
Copyright © 1998, Informix Software, Inc. All rights reserved.