Home | Previous Page | Next Page   SQL Features > Handling Character Data >

Specifying Column Substrings

In a query (or in any SQL statement containing an embedded SELECT statement), you can use bracket ( [ ] ) symbols to specify that only a subset of the data in a column of a character data type is to be retrieved. A column expression that includes brackets to signify a subset of the data in the column is known as a column substring. The syntax of a column substring is as follows.

Read syntax diagramSkip visual syntax diagram                          (1)
>>-| Expression segment |--------------------------------------->
 
>--column--+----------------------+----------------------------><
           '-[--first--,--last--]-'
 

Notes:
  1. See IBM Informix: Guide to SQL Syntax.
Element
Purpose
column
Identifier of a column within a database table or view
first. last
Positions of the first and the last byte (respectively) of the retrieved substring

Column Substrings in Single-Byte Code Sets

Suppose that you want to retrieve the customer_num column and the seventh through ninth bytes of the lname column from the customer table. To perform this query, use a column substring for the lname column in your SELECT statement, as follows:

SELECT customer_num, lname[7,9] as lname_subset
   FROM customer WHERE lname = 'Albertson'

If the lname column value is Albertson, the query returns these results.

customer_num lname_subset
114 son

Because the locale supports a single-byte code set, the preceding query seems to return the seventh through ninth characters of the name Albertson. Column substrings, however, are byte based, and the query returns the seventh through ninth bytes of the name. Because one byte is equal to one character in single-byte code sets, the distinction between characters and bytes in column substrings is not apparent in these code sets.

Column Substrings in Multibyte Code Sets

For multibyte code sets, column substrings return the specified number of bytes, not the number of characters. If a character column multi_col contains a string of three 2-byte characters, this 6-byte string can be represented as follows:

A1A2B1B2C1C2

Suppose that a query specified this substring from the multi_col column:

multi_col[1,2]

The query returns the following result:

A1A2

The returned substring consists of 2 bytes (1 character), not 2 characters.

To retrieve the first two characters from the multi_col column, specify a substring in which first is the position of the first byte in the first character and last is the position of the last byte in the second character. For the 6-byte string A1A2B1B2C1C2, this g expression specifies the substring in your query:

multi_col[1,4]

The following result is returned:

A1A2B1B2

The substring that the query returns consists of the first 4 bytes of the column value, representing the first two logical characters in the column.

Partial Characters in Column Substrings

A multibyte character might consist of 2, 3, or 4 bytes. A multibyte character that has lost one or more of its bytes so that the original intended meaning of the character is lost is called a partial character.

Unless prevented, a column substring might truncate a multibyte character or split it up in such a manner that it no longer retains the original sequence of bytes. A partial character might be generated when you use column subscript operators on columns that contain multibyte characters. Suppose that a user specifies the following column substring for the multi_col column where the value of the string in multi_col is A1A2B1B2C1C2:

multi_col[2,5]

The user requests the following bytes in the query: A2B1B2C1. If the database server returned this column substring to the user, however, the first and third logical characters in the column would be truncated.

Avoidance in a Multibyte Code Set

Informix database servers do not allow partial characters to occur. The GLS feature prevents the database server from returning the specified range of bytes literally when this range contains partial characters. If your database locale supports a multibyte code set and you specify a particular column substring in a query, the database server replaces any truncated multibyte characters with single-byte whitespace characters.

For example, suppose the multi_col column contains the string A1A2A3A4B1B2B3B4, and you execute the following SELECT statement:

SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A1A2B1B2'

The query returns no rows because the database server converts the substring multi_col[2,4], namely the string A2A3A4, to three single-byte blank spaces (sss). The WHERE clause specifies this search condition:

WHERE 'sss' = 'A1A2A3'

Because this condition is never true, the query retrieves no matching rows.

Informix database servers replace partial characters in each individual substring operation, even when they are concatenated.

For example, suppose the multi_col column contains A1A2B1B2C1C2D1D2, and the WHERE clause contains the following condition:

multi_col[2,4] | multi_col[6,8]

The query does not return any rows because the result of the concatenation (A2B1B2C2D1D2) contains two partial characters, A2 and C2. The Informix database server converts these partial characters to single-byte blank spaces and creates the following WHERE clause condition:

WHERE 'sB1B2sD1D2' = 'A1A2B1B2'

This condition is also never true, so the query retrieves no matching rows.

Errors Involving Partial Characters

Partial characters violate the relational model if the substrings strings can be processed or presented to users in any way that can prevent the concatenation of the substrings from reconstructing the original logical string.

This can occur when a multibyte character has a substring that is a valid character by itself. For example, suppose a multibyte code set contains a 4-byte character, A1A2A3A4, that represents the digit 1 and a 3-byte character, A2A3A4, that represents the digit 6. Suppose also that your locale is using this multibyte code set when you execute the following query:

SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A2A3A4'

The database server interprets multi_col[2,4] as the valid 3-byte character (a multibyte 6) instead of a substring of the valid 4-byte character ('sss').

Therefore, the WHERE clause contains the following condition:

WHERE '6' = '6'

Partial characters do not occur in single-byte code sets because each character is stored in a single byte. If the database locale supports a single-byte code set, and you specify a column substring in a query, the query returns exactly the requested subset of data; no characters are replaced with whitespace.

Partial Characters in an ORDER BY Clause

Partial characters might also create a problem when you specify column substrings in an ORDER BY clause of a SELECT statement.

The syntax for specifying column substrings in the ORDER BY clause is as follows.

Read syntax diagramSkip visual syntax diagram                        (1)
>>-| SELECT statement |----------------------------------------->
 
>--ORDER BY--column--+----------------------+------------------><
                     '-[--first--,--last--]-'
 

Notes:
  1. See IBM Informix: Guide to SQL Syntax.
Element
Purpose
column
Name of a column in the specified table or view.
first. last
Positions of the first and last byte (respectively) of the substring

The query results are sorted by the values contained in this column.

Extended Parallel Server

Any column or expression specified in the ORDER BY clause must be listed explicitly or implicitly in the SELECT list of the Projection clause.

End of Extended Parallel Server

If the locale supports a multibyte code set whose characters are all of the same length, you can use column substrings in an ORDER BY clause. The more typical scenario, however, is that your multibyte code set contains characters with varying lengths. In this case, you might not find it useful to specify column substrings in the ORDER BY clause.

For example, suppose that you wish to retrieve all the rows of the multi_data table, and sort the results according to a substring defined as the fourth through sixth characters of the multi_chars column, using this query:

SELECT * FROM multi_data ORDER BY multi_chars[7,12]

If the locale supports a multibyte code set whose characters are all 2 bytes in length, you know that the fourth character in the column begins in byte position 7, and the sixth character in the column ends in byte position 12. The preceding SELECT statement does not generate partial characters.

If the multibyte code set contains a mixture of single-byte characters, 2-byte characters, and 3-byte characters, however, the substring multi_chars[7,12] might create partial characters. In this case, you might get unexpected results when you specify a column substring in the ORDER BY clause.

For information on the collation of different types of character data in the ORDER BY clause, see The ORDER BY Clause. For the complete syntax and usage of the ORDER BY clause, see the SELECT statement in the IBM Informix: Guide to SQL Syntax.

Tip:
A partial character might also be generated when a SQL API copies multibyte data from one buffer to another. For more information, see Generating Non-ASCII Filenames.
Avoidance in TEXT and BYTE Columns

Partial characters are not a problem when you specify a column substring for a column of the TEXT or BYTE data type. The database server avoids partial characters in TEXT and BYTE columns in the following way:

Warning:
The processing and interpretation of TEXT and BYTE data are the responsibility of the client application, which must handle the possibility of partial characters in these operations.
Home | [ Top of Page | Previous Page | Next Page | Contents | Index ]