![]() |
|
The GLS feature allows you to put non-ASCII characters (including multibyte characters) in the following parts of an SQL statement:
You use quoted strings in a variety of SQL statements, particularly data manipulation statements such as SELECT and INSERT. A quoted string is a sequence of characters that is delimited by quotation marks. The quotation marks can be single quotes or double quotes. However, if the DELIMIDENT environment variable is set, the database server interprets a sequence of characters in double quotes as a delimited identifier rather than as a string. For more information about delimited identifiers, see Non-ASCII Characters in Identifiers.
When you use a nondefault locale, you can use any characters in the code set of your locale within a quoted string. If the locale supports a code set with non-ASCII characters, you can use these characters in a quoted string. In the following example, the user inserts column values that include multibyte characters in the table mytable:
In this example, the first quoted string includes the multibyte characters A1A2 and B1B2. The second quoted string includes the multibyte characters X1X2 and Y1Y2. The third quoted string contains only single-byte characters. This example assumes that the locale supports a multibyte code set with the A1A2, B1B2, X1X2, and Y1Y2 characters.
For complete information on quoted strings, see the Quoted String segment in the Informix Guide to SQL: Syntax.
To use comments after SQL statements, introduce the comment text with one of the following comment symbols:
When you use a nondefault locale, you can use any characters in the code set of your locale within a comment. If the locale supports a code set with non-ASCII characters, you can use these characters in an SQL comment. In the following example, the user inserts a column value that includes multibyte characters in the table mytable:
In this example, the SQL comment includes the multibyte characters A1A2 and B1B2. This example assumes that the locale supports a multibyte code set with the A1A2 and B1B2 characters.
For complete information on SQL comments and comment symbols, see the Informix Guide to SQL: Syntax.
When you specify a column expression with a character data type in a SELECT statement (or in any other SQL statement that includes an embedded SELECT statement), you can specify that a subset of the data in the column is to be retrieved. A column expression that includes brackets to signify a subset of the data in the column is known as a column substring. The syntax of a column substring is as follows.
Element | Purpose |
---|---|
first | Position of the first byte in the substring |
last | Position of the last byte in the substring |
Suppose that you want to retrieve the customer_num column and the seventh through ninth bytes of the lname column from the customer table. To perform this query, use a column substring for the lname column in your SELECT statement, as follows:
If the value of the lname column is Albertson, the following sample output shows the result of the query.
Because the locale supports a single-byte code set, the preceding query seems to return the seventh through ninth characters of the name Albertson. However, column substrings are byte based, and the query returns the seventh through ninth bytes of the name. Because one byte is equal to one character in single-byte code sets, the distinction between characters and bytes in column substrings is not apparent in these code sets.
For multibyte code sets, column substrings return the specified number of bytes, not number of characters. If a character column multi_col contains a string that consists of three 2-byte characters, this 6-byte string can be represented as follows:
Suppose that you specified the following column substring for the multi_col column in a query:
The query returns the following result:
The substring that the query returns consists of 2 bytes (1 character), not 2 characters.
To retrieve the first two characters from the multi_col column, specify a column substring in which first is the byte position of the first byte in the first character and last is the byte position of the last byte in the second character. For the 6-byte string A1A2B1B2C1C2, you specify this column substring as follows in your query:
The following result is returned:
The substring that the query returns consists of the first 4 bytes of the column value as you specified. These 4 bytes represent the first two characters in the column.
A multibyte character might consist of 2, 3, or 4 bytes. A multibyte character that has lost one or more of its bytes so that the original intended meaning of the character is lost is called a partial character.
Unless prevented, a column substring might truncate a multibyte character or split it up in such a manner that it no longer retains the original sequence of bytes. A partial character might be generated when you use column subscript operators on columns that contain multibyte characters. Suppose that a user specifies the following column substring for the multi_col column where the value of the string in multi_col is A1A2B1B2C1C2:
The user requests the following bytes in the query: A2B1B2C1. However, if the database server returned this column substring to the user, the first and third characters in the column would be truncated.
Avoidance in a Multibyte Code SetInformix database servers do not allow partial characters to occur. The GLS feature prevents the database server from returning the specified range of bytes literally when this range contains partial characters. If your database locale supports a multibyte code set and you specify a particular column substring in a query, the database server replaces any truncated multibyte characters with single-byte white spaces.
For example, suppose the multi_col column contains the string A1A2A3A4B1B2B3B4, and you execute the following SELECT statement:
The query indicates that no matching rows were found because the database server converts the substring multi_col[2,4], the string A2A3A4, to three single-byte spaces (sss). The WHERE clause of the query specifies the following condition for the search:
Because this condition is never true, the query retrieves no matching rows.
Informix database servers replace partial characters in each individual substring operation, even when they are concatenated. For example, suppose the multi_col column contains A1A2B1B2C1C2D1D2, and the WHERE clause contains the following condition:
The query does not return any rows because the result of the concatenation (A2B1B2C2D1D2) contains two partial characters, A2 and C2. The Informix database server converts these partial characters to single-byte spaces and creates the following WHERE clause condition:
This condition is also never true, so the query retrieves no matching rows.
Partial characters present a problem if the substrings strings can be processed or presented to users in any way that makes their concatenation not reconstruct the original logical string. Possible problem areas include when a substring of one multibyte character is actually a valid character by itself. For example, suppose a multibyte code set contains a 4-byte character, A1A2A3A4, that represents the digit 1 and a 3-byte character, A2A3A4, that represents the digit 6. Suppose also that you use the locale that contains this multibyte code set when you execute the following query:
The database server interprets multi_col[2,4] as the valid 3-byte character (a multibyte 6) instead of a substring of the valid 4-byte character ('sss'). Therefore, the WHERE clause contains the following condition:
The problem of partial characters does not occur in single-byte code sets because each character is stored in a single byte. When your database locale supports a single-byte code set, and you specify a particular column substring in a query, the database server returns exactly the subset of data that you requested and does not replace any characters with white spaces.
Partial characters might also create a problem when you specify column substrings in an ORDER BY clause of a SELECT statement. The syntax for specifying column substrings in the ORDER BY clause is as follows.
If the locale supports a multibyte code set whose characters are all of the same length, you can use column substrings in an ORDER BY clause. However, the more likely scenario is that your multibyte code set contains characters with varying lengths. In this case, you might not find it useful to specify column substrings in the ORDER BY clause.
For example, suppose that you want to retrieve all rows from the multi_data table, and you want to use the multi_chars column with a column subscript to collate the query results. The following SELECT statement attempts to collate the query results according to the portion of the multi_chars column that is contained in the fourth to sixth characters of the column:
If the locale supports a multibyte code set whose characters are all 2 bytes in length, you know that the fourth character in the column begins in byte position 7, and the sixth character in the column ends in byte position 12. The preceding SELECT statement does not generate partial characters.
However, if the multibyte code set contains a mixture of single-byte characters, 2-byte characters, and 3-byte characters, the column substring multi_chars[7,12] might create partial characters from the multi_chars data. In this case, you might get unexpected results when you specify a column substring in the ORDER BY clause.
For information on the collation order of different types of character data in the ORDER BY clause, see The ORDER BY Clause. For the complete syntax and usage of the ORDER BY clause, see the SELECT statement in the Informix Guide to SQL: Syntax.
Tip: A partial character might also be generated when a SQL API copies multibyte data from one buffer to another. For more information, see Generating Non-ASCII Filenames.
Avoidance in TEXT and BYTE Columns
Partial characters are not a problem when you specify a column substring for a column with the TEXT or BYTE data type. The database server avoids partial characters in TEXT and BYTE columns in the following way:
Warning: The processing and interpretation of TEXT and BYTE data are the responsibility of the client application, which must handle the possibility of partial characters in these operations.
The TRIM function is an SQL function that removes leading or trailing pad characters from a character string. By default, this pad character is an ASCII space. If your locale supports a code set that defines a different character as a space, TRIM does not remove this locale-specific space from the front or back of a string. If you specify the LEADING, TRAILING, or BOTH keywords for TRIM, you can define a different pad character. However, you cannot specify a non-ASCII character as a pad character, even if your locale supports a code set that defines the non-ASCII character.
The SQL search functions UPPER, LOWER, and INITCAP support GLS. They accept multibyte characters in character-type source strings and operate on them. The return type is the same as the type of the source string:
For complete information about these search functions, see the Informix Guide to SQL: Syntax.
Collation involves the sorting of the data values in columns that have character data types. For an explanation of collation order and a discussion of the two methods of sorting character data (code-set order and localized order), see Character Classes of the Code Set.
The type of collation order that the database server uses affects the following SQL statements:
The CREATE INDEX statement creates an index on one or more columns of a table. The ASC and DESC keywords in the CREATE INDEX statement control whether the index keys are stored in ascending or descending order.
When you use a nondefault locale, the following locale-specific considerations apply to the CREATE INDEX statement:
If you use the default locale (U.S. English), the index keys are stored in the code-set order (in ascending or descending sequence) of the default code set regardless of the data type of the character column. Because the default locale does not define a localized order, the database server sorts columns of the following data types in code-set order:
The SELECT statement performs a query on the specified table and retrieves data from the specified columns and rows. Collation order affects the following parts of the SELECT statement:
The ORDER BY clause sorts the retrieved rows by the values that are contained in a column or set of columns. When this clause sorts character columns, the results of the sort depend on the data type of the column, as follows:
Assume that you use a nondefault locale for the client and database locale, and you make a query against the table called abonnés. This SELECT statement specifies three columns of CHAR data type in the select list: numéro (employee number), nom (last name), and prénom (first name).
The statement sorts the query results by the values that are contained in the nom column. Because the nom column that is specified in the ORDER BY clause is a CHAR column, the database server sorts the query results in the code-set order. As the following table shows, names that begin with uppercase letters come before names that begin with lowercase letters, and names that start with an accented letter (Ålesund, Étaix, Ötker, and Øverst) come at the end of the list.
Figure 3-2
However, the result of the query is different if the numéro, nom, and prénom columns of the abonnés table are defined as NCHAR rather than CHAR.
Suppose the nondefault locale defines a localized order that collates the data as the following table shows. This localized order defines equivalence classes for uppercase and lowercase letters and for unaccented and accented versions of the same letter.
Figure 3-3
The same SELECT statement now returns the query results in localized order because the nom column that the ORDER BY clause specifies is an NCHAR column.
The SELECT statement supports use of a column substring in an ORDER BY clause. However, you need to ensure that this use for column substrings works with the code set that your locale supports. For more information, see Partial Characters in Column Substrings.
Logical Predicates in a WHERE ClauseThe WHERE clause specifies search criteria and join conditions on the data that you want to select. Collation rules affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following logical predicates:
The following SELECT statement assumes a nondefault locale. It uses the less than (<) relational operator to specify that the only rows are to be retrieved from the abonnés table are those in which the value of the nom column is less than Hammer.
If nom is a CHAR column, the database server uses code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The following sample of output shows that this SELECT statement retrieves only two rows.
These two rows are those less than Hammer in the code-set-ordered data set shown in Figure 3-2 on page 3-32.
However, if nom is an NCHAR column, the database server uses localized order to sort the rows that the WHERE clause specifies. The following sample of output shows that this SELECT statement retrieves six rows.
numéro | nom | prénom |
---|---|---|
13612 | Azevedo | Edouardo Freire |
13601 | Ålesund | Sverre |
13600 | da Sousa | João Lourenço Antunes |
13615 | di Girolamo | Giuseppe |
13606 | Dupré | Michèle Françoise |
13608 | Étaix | Émile |
These six rows are those less than Hammer in the localized-order data set shown in Figure 3-3 on page 3-33.
BETWEEN ConditionsThe following SELECT statement assumes a nondefault locale and uses a BETWEEN condition to retrieve only those rows in which the values of the nom column are in the inclusive range of the values of the two expressions that follow the BETWEEN keyword:
The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses the code-set order of the default code set to retrieve the rows that the WHERE clause specifies. The following sample output shows the query results.
Because the database server uses the code-set order for the nom values, as Figure 3-2 on page 3-32 shows, these query results do not include the following rows:
However, if nom is an NCHAR column, the database server uses localized order to sort the rows. The following sample output shows the query results.
Because the database server uses localized order for the nom values, these query results include rows in which the value of nom begins with a lowercase letter or accented letter.
IN ConditionsAn IN condition is satisfied when the expression to the left of the IN keyword is included in the parenthetical list of values to the right of the keyword. This SELECT statement assumes a nondefault locale and uses an IN condition to retrieve only those rows in which the value of the nom column is any of the following: Azevedo, Llanero, or Oatfield.
The query result depends on whether nom is a CHAR or NCHAR column. If nom is a CHAR column, the database server uses code-set order, as Figure 3-2 on page 3-32 shows. The database server retrieves rows in which the value of nom is Azevedo, but not rows in which the value of nom is azevedo or Åzevedo because the characters A, a, and Å are not equivalent in the code-set order. The query also returns rows with the nom values of Llanero and Oatfield.
However, if nom is an NCHAR column, the database server uses localized order, as Figure 3-3 on page 3-33 shows, to sort the rows. If the locale defines A, a, and Å as equivalent characters in the localized order, the query returns rows in which the value of nom is Azevedo, azevedo, or Åzevedo. The same selection rule applies to the other names in the parenthetical list that follows the IN keyword.
Collation rules also affect the WHERE clause when the expressions in the condition are column expressions with character data types and the search condition is one of the following conditions:
A MATCHES condition tests for matching character strings. The condition is true, or satisfied, when the value of the column to the left of the MATCHES keyword matches the pattern that a quoted string specifies to the right of the MATCHES keyword.You can use wildcard characters in the string. For example, you can use brackets to specify a range of characters. For more information about MATCHES, see the Informix Guide to SQL: Syntax.
When the MATCHES condition does not list a range of characters in the string, it specifies a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:
The examples in the following table illustrate the different results that CHAR and NCHAR columns produce when a user specifies the MATCHES keyword without a range in a SELECT statement. These examples assume use of a nondefault locale that defines A and a in an equivalence class. It also assumes that col1 is a CHAR column and col2 is an NCHAR column in table mytable.
When you use the MATCHES keyword to specify a range, collation considerations come into play for all columns with character data types. When the column to the left of the MATCHES keyword is an NCHAR, NVARCHAR, CHAR, or VARCHAR column, and the quoted string to the right of the MATCHES keyword includes brackets to specify a range, the database server uses localized order.
Important: When the database server determines the characters that fall within a range, it always uses the localized order that is specified for the database, even for CHAR and VARCHAR columns. This behavior is an exception to the rule that the database server uses code-set order for all operations on CHAR and VARCHAR columns and localized order for all operations on NCHAR and NVARCHAR columns.
Some simple examples show how the database server treats NCHAR, NVARCHAR, CHAR, and VARCHAR columns when you use the MATCHES keyword with a range in a SELECT statement. Suppose that you want to retrieve from the abonnés table the employee number, first name, and last name for all employees whose last name nom begins in the range of characters E through P. Also assume that the nom column is an NCHAR column. The following SELECT statement uses a MATCHES condition in the WHERE clause to pose this query:
The rows for Étaix, Ötker, and Øverst appear in the query result because, in the localized order, as Figure 3-3 on page 3-33 shows, the accented first letter of each name falls within the E through P MATCHES range for the nom column.
If nom is a CHAR column, the query result is exactly the same as when nom was an NCHAR column. The database server always uses localized order to determine what characters fall within a range, regardless of whether the column is CHAR or NCHAR.
LIKE ConditionA LIKE condition also tests for matching character strings. As with the MATCHES condition, the LIKE condition is true, or satisfied, when the value of the column to the left of the LIKE keyword matches the pattern that the quoted string specifies to the right of the LIKE keyword. You can use only certain symbols as wildcards in the quoted string. For more information about LIKE, see the Informix Guide to SQL: Syntax.
The LIKE condition can specify only a literal match. For literal matches, the data type of the column determines whether collation considerations come into play, as follows:
The LIKE keyword does not support matches with a range. That is, you cannot use bracketed wildcard characters in LIKE conditions.
Wildcard Characters in LIKE and MATCHES ConditionsInformix products support the following ASCII characters as wildcard characters in the MATCHES and LIKE conditions.
For CHAR and VARCHAR data, the database server performs byte-by-byte comparison for pattern matching in the LIKE and MATCHES conditions. For NCHAR and NVARCHAR data, the database server performs pattern matching in the LIKE and MATCHES conditions based on logical characters, not bytes. Therefore, the _ (underscore) wildcard of the LIKE clause and the ? (question mark) wildcard of the MATCHES clause match any one single-byte or multibyte character, as the following table shows.
Condition | Quoted String | Column Value | Result |
---|---|---|---|
LIKE | 'ab_d' | 'abcd' | True |
LIKE | 'ab_d' | 'abA1A2d' | True |
MATCHES | 'ab?d' | 'abcd' | True |
MATCHES | 'ab?d' | 'abA1A2d' | True |
The database server treats any multibyte character as a literal character. To tell the database server to interpret a wildcard character as its literal meaning, you must precede the character with an escape character. You must use single-byte characters as escape characters; the database server does not recognize use of multibyte characters for this purpose. The default escape character is the backslash (\).
The following use of the MATCHES condition gives a true result for the column value that is shown.
You can use SQL length functions in the SELECT statement and other data manipulation statements. Length functions return the length of a column, string, or variable in bytes or characters.
The choice of locale affects the following three SQL length functions:
For the complete syntax of these functions, see the Expression segment in the Informix Guide to SQL: Syntax.
The LENGTH function returns the number of bytes of data in character data. However, the behavior of the LENGTH function varies with the type of argument that the user specifies. The argument can be a quoted string, a character-type column other than the TEXT data type, a TEXT column, a host variable, or an SPL routine variable.
The following table shows how the LENGTH function operates on each of these argument types. The Example column in this table uses the symbol s to represent a single-byte trailing white space. This table also assumes that the sample strings consist of single-byte characters.
When you use the default locale or any locale with a single-byte code set, the LENGTH function seems to return the number of characters in the column. In the following example, the stores_demo database, which contains the customer table, uses the default code set for the U.S. English locale. Suppose a user enters a SELECT statement with the LENGTH function to display the last name, length of the last name, and customer number for rows where the customer number is less than 106.
The following sample of output shows the result of the query. For each row that is retrieved, the length column seems to show the number of characters in the lname (cust_name) column. However, the length column actually displays the number of bytes in the lname column. In the default code set, one byte stores one character. For more information about the default code set, see The Default Locale.
With Multibyte Code Sets
When you use the LENGTH function in a locale that supports a multibyte code set, such as the Japanese SJIS code set, the distinction between characters and bytes is meaningful. The LENGTH function returns the number of bytes in the column or quoted string, and this result might be different from the number of characters in the string.
The following example assumes that the database that contains the customer_multi table has a database locale that supports a multibyte code set. Suppose that the user enters a SELECT statement with the LENGTH function to display the last name, the length of the last name, and the customer number for the customer whose customer number is 199.
Assume that the last name (lname) for customer 199 consists of four characters, represented as follows:
In this representation, the first character (the symbol a) is a single-byte character. The second character (the symbol A1A2) is a 2-byte character. The third character (the symbol b) is a single-byte character. The fourth character (the symbol B1B2) is a 2-byte character.
The following sample of output shows the result of the query. Although the customer first name consists of 4 characters, the length column shows that the total number of bytes in this name is 6.
The OCTET_LENGTH function returns the number of bytes and generally includes trailing white spaces in the byte count. This SQL length function uses the definition of white space that the locale defines. OCTET_LENGTH returns the number of bytes in a character column, quoted string, host variable, or procedure variable. However, the actual behavior of the OCTET_LENGTH function varies with the type of argument that the user specifies.
The following table shows how the OCTET_LENGTH function operates on each of the argument types. The Example column in this table uses the symbol s to represent a single-byte trailing white space. For simplicity, the Example column also assumes that the sample strings consist of single-byte characters.
The difference between the LENGTH and OCTET_LENGTH functions is that OCTET_LENGTH generally includes trailing white spaces in the byte count, whereas LENGTH generally excludes trailing white spaces from the byte count.
The advantage of the OCTET_LENGTH function over the LENGTH function is that the OCTET_LENGTH function provides the actual column size whereas the LENGTH function trims the column values and returns the length of the trimmed string. This advantage of the OCTET_LENGTH function applies both to single-byte code sets such as ISO8859-1 and multibyte code sets such as the Japanese SJIS code set.
The following table shows some results that the OCTET_LENGTH function might generate.
The CHAR_LENGTH function (also known as the CHARACTER_LENGTH function) returns the number of characters in a quoted string, column with a character data type, host variable, or procedure variable. However, the actual behavior of this function varies with the type of argument that the user specifies.
The following table shows how the CHAR_LENGTH function operates on each of the argument types. The Example column in this table uses the symbol s to represent a single-byte trailing white space. For simplicity, the Example column also assumes that the sample strings consist of single-byte characters.
The CHAR_LENGTH function is especially useful with multibyte code sets. If a quoted string of characters contains any multibyte characters, the number of characters in the string differs from the number of bytes in the string. You can use the CHAR_LENGTH function to determine the number of characters in the quoted string.
However, the CHAR_LENGTH function can also be useful in single-byte code sets. In these code sets, the number of bytes in a column is equal to the number of characters in the column. If you use the LENGTH function to determine the number of bytes in a column (which is equal to the number of characters in this case), LENGTH trims the column values and returns the length of the trimmed string. In contrast, CHAR_LENGTH does not trim the column values but returns the declared size of the column.
The following table shows some results that the CHAR_LENGTH function might generate for quoted strings.