In a query (or in any SQL statement containing an embedded SELECT statement), you can use bracket ( [ ] ) symbols to specify that only a subset of the data in a column of a character data type is to be retrieved. A column expression that includes brackets to signify a subset of the data in the column is known as a column substring. The syntax of a column substring is as follows.
(1) >>-| Expression segment |---------------------------------------> >--column--+----------------------+---------------------------->< '-[--first--,--last--]-'
Suppose that you want to retrieve the customer_num column and the seventh through ninth bytes of the lname column from the customer table. To perform this query, use a column substring for the lname column in your SELECT statement, as follows:
SELECT customer_num, lname[7,9] as lname_subset FROM customer WHERE lname = 'Albertson'
If the lname column value is Albertson, the query returns these results.
customer_num | lname_subset |
---|---|
114 | son |
Because the locale supports a single-byte code set, the preceding query seems to return the seventh through ninth characters of the name Albertson. Column substrings, however, are byte based, and the query returns the seventh through ninth bytes of the name. Because one byte is equal to one character in single-byte code sets, the distinction between characters and bytes in column substrings is not apparent in these code sets.
For multibyte code sets, column substrings return the specified number of bytes, not the number of characters. If a character column multi_col contains a string of three 2-byte characters, this 6-byte string can be represented as follows:
A1A2B1B2C1C2
Suppose that a query specified this substring from the multi_col column:
multi_col[1,2]
The query returns the following result:
A1A2
The returned substring consists of 2 bytes (1 character), not 2 characters.
To retrieve the first two characters from the multi_col column, specify a substring in which first is the position of the first byte in the first character and last is the position of the last byte in the second character. For the 6-byte string A1A2B1B2C1C2, this g expression specifies the substring in your query:
multi_col[1,4]
The following result is returned:
A1A2B1B2
The substring that the query returns consists of the first 4 bytes of the column value, representing the first two logical characters in the column.
A multibyte character might consist of 2, 3, or 4 bytes. A multibyte character that has lost one or more of its bytes so that the original intended meaning of the character is lost is called a partial character.
Unless prevented, a column substring might truncate a multibyte character or split it up in such a manner that it no longer retains the original sequence of bytes. A partial character might be generated when you use column subscript operators on columns that contain multibyte characters. Suppose that a user specifies the following column substring for the multi_col column where the value of the string in multi_col is A1A2B1B2C1C2:
multi_col[2,5]
The user requests the following bytes in the query: A2B1B2C1. If the database server returned this column substring to the user, however, the first and third logical characters in the column would be truncated.
Informix database servers do not allow partial characters to occur. The GLS feature prevents the database server from returning the specified range of bytes literally when this range contains partial characters. If your database locale supports a multibyte code set and you specify a particular column substring in a query, the database server replaces any truncated multibyte characters with single-byte whitespace characters.
For example, suppose the multi_col column contains the string A1A2A3A4B1B2B3B4, and you execute the following SELECT statement:
SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A1A2B1B2'
The query returns no rows because the database server converts the substring multi_col[2,4], namely the string A2A3A4, to three single-byte blank spaces (sss). The WHERE clause specifies this search condition:
WHERE 'sss' = 'A1A2A3'
Because this condition is never true, the query retrieves no matching rows.
Informix database servers replace partial characters in each individual substring operation, even when they are concatenated.
For example, suppose the multi_col column contains A1A2B1B2C1C2D1D2, and the WHERE clause contains the following condition:
multi_col[2,4] | multi_col[6,8]
The query does not return any rows because the result of the concatenation (A2B1B2C2D1D2) contains two partial characters, A2 and C2. The Informix database server converts these partial characters to single-byte blank spaces and creates the following WHERE clause condition:
WHERE 'sB1B2sD1D2' = 'A1A2B1B2'
This condition is also never true, so the query retrieves no matching rows.
Partial characters violate the relational model if the substrings strings can be processed or presented to users in any way that can prevent the concatenation of the substrings from reconstructing the original logical string.
This can occur when a multibyte character has a substring that is a valid character by itself. For example, suppose a multibyte code set contains a 4-byte character, A1A2A3A4, that represents the digit 1 and a 3-byte character, A2A3A4, that represents the digit 6. Suppose also that your locale is using this multibyte code set when you execute the following query:
SELECT multi_col FROM tablename WHERE multi_col[2,4] = 'A2A3A4'
The database server interprets multi_col[2,4] as the valid 3-byte character (a multibyte 6) instead of a substring of the valid 4-byte character ('sss').
Therefore, the WHERE clause contains the following condition:
WHERE '6' = '6'
Partial characters do not occur in single-byte code sets because each character is stored in a single byte. If the database locale supports a single-byte code set, and you specify a column substring in a query, the query returns exactly the requested subset of data; no characters are replaced with whitespace.
Partial characters might also create a problem when you specify column substrings in an ORDER BY clause of a SELECT statement.
The syntax for specifying column substrings in the ORDER BY clause is as follows.
(1) >>-| SELECT statement |-----------------------------------------> >--ORDER BY--column--+----------------------+------------------>< '-[--first--,--last--]-'
The query results are sorted by the values contained in this column.
Any column or expression specified in the ORDER BY clause must be listed explicitly or implicitly in the SELECT list of the Projection clause.
If the locale supports a multibyte code set whose characters are all of the same length, you can use column substrings in an ORDER BY clause. The more typical scenario, however, is that your multibyte code set contains characters with varying lengths. In this case, you might not find it useful to specify column substrings in the ORDER BY clause.
For example, suppose that you wish to retrieve all the rows of the multi_data table, and sort the results according to a substring defined as the fourth through sixth characters of the multi_chars column, using this query:
SELECT * FROM multi_data ORDER BY multi_chars[7,12]
If the locale supports a multibyte code set whose characters are all 2 bytes in length, you know that the fourth character in the column begins in byte position 7, and the sixth character in the column ends in byte position 12. The preceding SELECT statement does not generate partial characters.
If the multibyte code set contains a mixture of single-byte characters, 2-byte characters, and 3-byte characters, however, the substring multi_chars[7,12] might create partial characters. In this case, you might get unexpected results when you specify a column substring in the ORDER BY clause.
For information on the collation of different types of character data in the ORDER BY clause, see The ORDER BY Clause. For the complete syntax and usage of the ORDER BY clause, see the SELECT statement in the IBM Informix: Guide to SQL Syntax.
Partial characters are not a problem when you specify a column substring for a column of the TEXT or BYTE data type. The database server avoids partial characters in TEXT and BYTE columns in the following way:
A substring of a BYTE column returns the exact range of bytes that is specified and does not replace any bytes with whitespace characters.
A substring from a TEXT column returns the exact range of bytes that is specified. Attempts to resolve partial characters in TEXT data are resource intensive, but the database server does not replace any bytes with whitespace. For more information, see The TEXT Data Type.