Home | Previous Page | Next Page   SQL Features > Handling Character Data >

Collating Character Data

Collation is the process of sorting data values in columns that have character data types. For an explanation of collation order and a discussion of the two methods of sorting character data (code-set order and localized order), see Character Classes of the Code Set.

By default, the database server sorts strings according to the collation that the DB_LOCALE setting implies, and client applications sort according to the CLIENT_LOCALE setting, if this is different from the DB_LOCALE setting.

Dynamic Server

The SET COLLATION statement of Dynamic Server can specify a localized collation different from the DB_LOCALE setting for the current session.

See the IBM Informix: Guide to SQL Syntax for the syntax of this statement. Database objects that sort strings, such as indexes or triggers, use the collation that was in effect at the time of their creation when they sort NCHAR or NVARCHAR values, if this is different from the DB_LOCALE setting.

End of Dynamic Server

The collation order of the database server affects SQL statement that perform sorting operations, including CREATE INDEX and SELECT statements.

Collation Order in CREATE INDEX

The CREATE INDEX statement creates an index on one or more columns of a table. The ASC and DESC keywords in the CREATE INDEX statement control whether the index keys are stored in ascending or descending order.

When you use a nondefault locale, the following locale-specific considerations apply to the CREATE INDEX statement:

Dynamic Server

If the SET COLLATION statement specifies a database locale with localized collation that is different from the DB_LOCALE setting, any indexes (and any check constraints) that you subsequently create in the same session always use that localized collation for sorting NCHAR or NVARCHAR strings.

End of Dynamic Server

If you use the default locale (U.S. English), the index keys are stored in the code-set order (in ascending or descending order) of the default code set regardless of the data type of the character column. Because the default locale does not define a localized order, the database server that uses this locale (or any other locale that does not define a localized collating order) sorts strings from columns of the following data types in code-set order:

Collation Order in SELECT Statements

The SELECT statement performs a queries. Collation order affects the following parts of the SELECT statement:

The ORDER BY Clause

The ORDER BY clause sorts retrieved rows by the values that are contained in a column or set of columns. When this clause sorts character columns, the results of the sort depend on the data type of the column, as follows:

Assume that you use a nondefault locale for the client and database locale, and you make a query against the table called abonnés. This SELECT statement specifies three columns of CHAR data type in the select list: numéro (employee number), nom (last name), and prénom (first name).

SELECT numéro,nom,prénom 
   FROM abonnés
   ORDER BY nom;

The statement sorts the query results by the values that are contained in the nom column. Because the nom column that is specified in the ORDER BY clause is a CHAR column, the database server sorts the query results in the code-set order.

As this table shows, names that begin with uppercase letters come before names beginning with lowercase letters, and names that begin with an accented letter (Ålesund, Étaix, Ötker, and Øverst) are at the end of the list.

Table 5. Data Set for Code-Set Order of the abonnés Table
numéro nom prénom
13612 Azevedo Edouardo Freire
13606 Dupré Michèle Françoise
13607 Hammer Gerhard
13602 Hämmerle Greta
13604 LaForêt Jean-Noël
13610 LeMaître Héloïse
13613 Llanero Gloria Dolores
13603 Montaña José Antonio
13611 Oatfield Emily
13609 Tiramisù Paolo Alfredo
13600 da Sousa João Lourenço Antunes
13615 di Girolamo Giuseppe
13601 Ålesund Sverre
13608 Étaix Émile
13605 Ötker Hans-Jürgen
13614 Øverst Per-Anders

Results of the query is different, however, if the numéro, nom, and prénom columns of the abonnés table are defined as NCHAR rather than CHAR.

Suppose the nondefault locale defines a localized order that collates the data as the following table shows. This localized order defines equivalence classes for uppercase and lowercase letters and for unaccented and accented versions of the same letter.

Table 6. Data Set for Localized Order of the abonnés Table
numéro nom prénom
13612 Azevedo Edouardo Freire
13601 Ålesund Sverre
13600 da Sousa João Lourenço Antunes
13615 di Girolamo Giuseppe
13606 Dupré Michèle Françoise
13608 Étaix Émile
13607 Hammer Gerhard
13602 Hämmerle Greta
13604 LaForêt Jean-Noël
13610 LeMaître Héloïse
13613 Llanero Gloria Dolores
13603 Montaña José Antonio
13611 Oatfield Emily
13605 Ötker Hans-Jürgen
13614 Øverst Per-Anders
13609 Tiramisù Paolo Alfredo

The same SELECT statement now returns the query results in localized order because the nom column that the ORDER BY clause specifies is an NCHAR column.

The SELECT statement supports use of a column substring in an ORDER BY clause. However, you need to ensure that this use for column substrings works with the code set that your locale supports. For more information, see Partial Characters in Column Substrings.

Logical Predicates in a WHERE Clause

The WHERE clause specifies search criteria and join conditions on the data that you want to select. Collation rules affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following logical predicates:

Relational-Operator Conditions

The following SELECT statement assumes a nondefault locale. It uses the less than (<) relational operator to specify that the only rows are to be retrieved from the abonnés table are those in which the value of the nom column is less than Hammer.

SELECT numéro,nom,prénom 
   FROM abonnés
   WHERE nom < 'Hammer';

If nom is a CHAR column, the database server uses code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The output shows that this SELECT statement retrieves only two rows.

numéro nom prénom
13612 Azevedo Edouardo Freire
13606 Dupré Michèle Françoise

These two rows are those less than Hammer in the code-set-ordered data set shown in Table 5.

However, if nom is an NCHAR column, the database server uses localized order to sort the rows that the WHERE clause specifies. The following example of output shows that this SELECT statement retrieves six rows.

numéro nom prénom
13612 Azevedo Edouardo Freire
13601 Ålesund Sverre
13600 da Sousa João Lourenço Antunes
13615 di Girolamo Giuseppe
13606 Dupré Michèle Françoise
13608 Étaix Émile

These six rows are those less than Hammer in the localized-order data set shown in Table 6.

BETWEEN Conditions

The following SELECT statement assumes a nondefault locale and uses a BETWEEN condition to retrieve only those rows in which the values of the nom column are in the inclusive range of the values of the two expressions that follow the BETWEEN keyword:

SELECT numéro,nom,prénom 
   FROM abonnés
   WHERE nom BETWEEN 'A' AND 'Z';

The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses the code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The following example of output shows the query results.

numéro nom prénom
13612 Azevedo Edouardo Freire
13606 Dupré Michèle Françoise
13607 Hammer Gerhard
13602 Hämmerle Greta
13604 LaForêt Jean-Noël
13610 LeMaître Héloïse
13613 Llanero Gloria Dolores
13603 Montaña José Antonio
13611 Oatfield Emily
13609 Tiramisù Paolo Alfredo

Because the database server uses the code-set order for the nom values, as Table 5 shows, these query results do not include the following rows:

However, if nom is an NCHAR column, the database server uses localized order to sort the rows. The following output shows the query results.

numéro nom prénom
13612 Azevedo Edouardo Freire
13601 Ålesund Sverre
13600 da Sousa João Lourenço Antunes
13615 di Girolamo Giuseppe
13606 Dupré Michèle Françoise
13608 Étaix Émile
13607 Hammer Gerhard
13602 Hämmerle Greta
13604 LaForêt Jean-Noël
13610 LeMaître Héloïse
13613 Llanero Gloria Dolores
13603 Montaña José Antonio
13611 Oatfield Emily
13605 Ötker Hans-Jürgen
13614 Øverst Per-Anders
13609 Tiramisù Paolo Alfredo

Because the database server uses localized order for the nom values, these query results include rows in which the value of nom begins with a lowercase letter or accented letter.

IN Conditions

An IN condition is satisfied when the expression to the left of the IN keyword is included in the parenthetical list of values to the right of the keyword. This SELECT statement assumes a nondefault locale and uses an IN condition to retrieve only those rows in which the value of the nom column is any of the following: Azevedo, Llanero, or Oatfield.

SELECT numéro,nom,prénom 
   FROM abonnés
   WHERE nom IN ('Azevedo', 'Llanero', 'Oatfield');

The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses code-set order, as Table 5 shows. The database server retrieves rows in which the value of nom is Azevedo, but not rows in which the value of nom is azevedo or Åzevedo because the characters A, a, and Å are not equivalent in the code-set order. The query also returns rows with the nom values of Llanero and Oatfield.

However, if nom is an NCHAR column, the database server uses localized order, as Table 6 shows, to sort the rows. If the locale defines A, a, and Å as equivalent characters in the localized order, the query returns rows in which the value of nom is Azevedo, azevedo, or Åzevedo. The same selection rule applies to the other names in the parenthetical list that follows the IN keyword.

Comparisons with MATCHES and LIKE Conditions

Collation rules also affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following conditions:

MATCHES Condition

A MATCHES condition tests for matching character strings. The condition is true, or satisfied, when the value of the column to the left of the MATCHES keyword matches the pattern that a quoted string specifies to the right of the MATCHES keyword. You can use wildcard characters in the string. For example, you can use brackets to specify a range of characters. For more information about MATCHES, see the IBM Informix: Guide to SQL Syntax.

When a MATCHES expression does not list a range of characters in the string, it specifies a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:

The examples in the following table illustrate the different results that CHAR and NCHAR columns produce when a user specifies the MATCHES keyword without a range in a SELECT statement. These examples assume use of a nondefault locale that defines A and a in an equivalence class. It also assumes that col1 is a CHAR column and col2 is an NCHAR column in table mytable.

Query Data Type Query Results
SELECT * FROM mytable WHERE col1 MATCHES 'art' CHAR All rows in which column col1 contains the value 'art' with a lowercase a
SELECT * FROM mytable WHERE col2 MATCHES 'art' NCHAR All rows in which column col2 contains the value 'art' or 'Art'

When you use the MATCHES keyword to specify a range, collation considerations come into play for all columns with character data types. When the column to the left of the MATCHES keyword is an NCHAR, NVARCHAR, CHAR, VARCHAR, or (for Dynamic Server only) LVARCHAR data type, and the string operand of the MATCHES keyword includes brackets ( [ ] ) to specify a range, sorting follows a localized order, if the locale defines one.

Important:
When the database server determines the characters that fall within a range with the MATCHES operator, it uses the localized order, if DB_LOCALE or SET COLLATION has specified one, even for CHAR, LVARCHAR, and VARCHAR columns. This behavior is an exception to the rule that the database server uses code-set order for all operations on CHAR, LVARCHAR and VARCHAR columns, and localized order (if one is defined) for sorting operations on NCHAR and NVARCHAR columns.

Some simple examples show how the database server treats NCHAR, NVARCHAR, LVARCHAR, CHAR, and VARCHAR columns when you use the MATCHES keyword with a range in a SELECT statement. Suppose that you want to retrieve from the abonnés table the employee number, first name, and last name for all employees whose last name nom begins in the range of characters E through P. Also assume that the nom column is an NCHAR column. The following SELECT statement uses a MATCHES condition in the WHERE clause to pose this query:

SELECT numéro,nom,prénom 
   FROM abonnés
   WHERE nom MATCHES '[E-P]*'
   ORDER BY nom;

The rows for Étaix, Ötker, and Øverst appear in the query result because, in the localized order, as Table 6 shows, the accented first letter of each name falls within the E through P MATCHES range for the nom column.

numéro nom prénom
13608 Étaix Émile
13607 Hammer Gerhard
13602 Hämmerle Greta
13604 LaForêt Jean-Noël
13610 LeMaître Héloïse
13613 Llanero Gloria Dolores
13603 Montaña José Antonio
13611 Oatfield Emily
13605 Ötker Hans-Jürgen
13614 Øverst Per-Anders

If nom is a CHAR column, the query result is exactly the same as when nom was an NCHAR column. The database server always uses localized order to determine what characters fall within a range, regardless of whether the column is CHAR or NCHAR.

LIKE Condition

A LIKE condition also tests for matching character strings. As with the MATCHES condition, the LIKE condition is true, or satisfied, when the value of the column to the left of the LIKE keyword matches the pattern that the quoted string specifies to the right of the LIKE keyword. You can use only certain symbols as wildcards in the quoted string. For more information about LIKE, see the IBM Informix: Guide to SQL Syntax.

The LIKE condition can specify only a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:

The LIKE keyword does not support ranges of characters. That is, you cannot use bracketed characters to specify a range in LIKE conditions.

Wildcard Characters in LIKE and MATCHES Conditions

IBM Informix products support the following ASCII characters as wildcard characters in the MATCHES and LIKE conditions.

Condition Wildcard Characters
LIKE _ %
MATCHES * ? [ ] ^ -

For CHAR and VARCHAR data, the database server performs byte-by-byte comparison for pattern matching in the LIKE and MATCHES conditions. For NCHAR and NVARCHAR data, the database server performs pattern matching in the LIKE and MATCHES conditions based on logical characters, not bytes. Therefore, the underscore ( _ ) wildcard of the LIKE clause and the ? (question mark) wildcard of the MATCHES clause match any one single-byte or multibyte character, as the following table shows.

Condition Quoted String Column Value Result
LIKE 'ab_d' 'abcd' True
LIKE 'ab_d' 'abA1A2d' True
MATCHES 'ab?d' 'abcd' True
MATCHES 'ab?d' 'abA1A2d' True

The database server treats any multibyte character as a literal character. To tell the database server to interpret a wildcard character as its literal meaning, you must precede the character with an escape character. You must use single-byte characters as escape characters; the database server does not recognize use of multibyte characters for this purpose. The default escape character is the backslash (\) symbol.

The following MATCHES condition returns a TRUE result for the column value that is shown.

Condition Quoted String Column Value Result
MATCHES 'ab\?d' 'ab?d' True
Home | [ Top of Page | Previous Page | Next Page | Contents | Index ]