The UPDATE STATISTICS statement collects statistics about the tables in your database. It automatically collects statistics for all columns with built-in data types (except TEXT and BYTE). However, it cannot automatically collect statistics for columns with user-defined data types because it does not know the structure of these data types.
For UPDATE STATISTICS to collect statistics for a column with a user-defined data type, you must write a user-defined function named statcollect( ) that collects statistics data for your user-defined data type. The UPDATE STATISTICS statement takes the following steps for columns of user-defined data types:
This statcollect( ) function gathers the statistics data for the column and stores it as the stat opaque data type.
UPDATE STATISTICS stores the following information in the row of the sysdistrib table that corresponds to the user-defined-type column:
To have the UPDATE STATISTICS statement collect statistics for your user-defined data type, you must:
If a statcollect( ) function does not exist for your user-defined data type, UPDATE STATISTICS does not collect statistics data for any column of that type.
Before you begin to code a statcollect( ) function for a particular user-defined data type, you need to decide what it means to collect statistics on this data type. For example, consider the following issues:
To be able to group the values into bins of related values, the data must have some kind of implied sequence. A common use of statistics information is within a selectivity function for a query filter such as "less than" or "greater than". If the values of the user-defined data type do not have ordering, they would not logically be used in such filters. For more info, see Query Selectivity.
For example, the distribution can ignore NULL values or it could aggregate them. However, the handling of the NULL values should make sense to the user-defined data type.
When you declare your statistics-collection function, it must have the following C signature:
mi_statret *statcollect(udt_arg, num_rows, resolution, fparam_ptr) udt_type *udt_arg; mi_double_precision *num_rows; mi_double_precision *resolution; MI_FPARAM *fparam_ptr;
Figure 141 shows a C declaration of the statistics-collection function for the longlong opaque type.
mi_statret *statcollect_ll(ll_arg, num_rows, resolution, fparam_ptr) longlong_t *ll_arg; mi_double_precision *num_rows; mi_double_precision *resolution; MI_FPARAM *fparam_ptr;
BladeSmith automatically generates an OpaqueStatCollect( ) function (in which Opaque is the name of your opaque data type) with the following declaration:
mi_lvarchar *OpaqueStatCollect(Gen_pColValue, Gen_Resolution, Gen_RowCount, Gen_fparam) Opaque *Gen_pColValue; mi_double_precision *Gen_Resolution; mi_double_precision *Gen_RowCount; MI_FPARAM *Gen_fparam;
If this declaration is not appropriate for your opaque type, you must customize the OpaqueStatCollect( ) function.
The statcollect( ) user-defined function is an iterator function; that is, the database server calls statcollect( ) for each of the rows on whose column of a user-defined data type UPDATE STATISTICS is collecting statistics. As with other iterator functions, the database server uses an iterator-status constant to indicate when the statistics-collection function is called.
The following table summarizes the values of the iterator-status constant for the statcollect( ) function.
To obtain the iterator-status constant in each iteration, your statcollect( ) function can use a switch statement on the return value of the mi_fp_request( ) function, as follows:
switch ( mi_fp_request(fparam_ptr) )
{
case SET_INIT:
...
case SET_RETONE:
...
case SET_END:
...
}
If statcollect( ) raises an error, UPDATE STATISTICS terminates the statistics collection for that column.
The following sections summarize the steps that statcollect( ) must take for each of these iterator-status constants. For general information about iterator-status constants, see Table 102.
When the iterator-status constant is SET_INIT, the database server has invoked the initial call to statcollect( ). Usually, in this initial call, your statcollect( ) function allocates and initializes an internal C structure, called a statistics-collection structure. The statistics-collection structure is a holding area for the statistics data that statcollect( ) gathers on a row-by-row basis.
BladeSmith generates the OpaqueStatCollect( ) function (in which Opaque is the name of your opaque data type), which allocates a statistics-collection structure Opaque_stat_t (declared in a file with the .h extension). This structure contains the following information.
Element of Statistics-Collection Structure | Description | Data Type |
---|---|---|
count | Current number of rows | mi_integer |
max | Maximum value | mi_integer |
min | Minimum value | mi_integer |
distribution[] | An array to hold the "in-progress" statistics data | An array of mi_integer values whose size is the number of elements that can fit into the text distribution area (usually 256 bytes) |
BladeSmith generates statistics code under the assumption that the minimum, maximum, and distribution of values are appropriate for your opaque data type. The SET_INIT case in the OpaqueStatCollect( ) function calls the Opaque_SetMaxValue( ) and Opaque_SetMaxValue( ) functions (which you must implement) to initialize maximum and minimum values, respectively. It initializes the current row count and the elements of the distribution array to zero (0).
If this statistics data is not appropriate for your opaque type, take the following actions:
Your statcollect( ) function can use the MI_FPARAM structure to store this statistics-collection structure (and any other state information) between iterations of statcollect( ). Allocate any memory used across multiple iterations of statcollect( ) from the PER_COMMAND pool and free it as soon as possible. Allocate any memory not used across multiple invocations of statcollect( ) from the PER_ROUTINE memory pool.
Use the mi_fp_setfuncstate( ) function to save a pointer to the user-state memory in the MI_FPARAM structure of your statcollect( ) function. For more information, see Saving a User State.
For each row of a table, the statcollect( ) function collects the statistics data for the column that has the user-defined data type. When the iterator-status constant is SET_RETONE, the database server has invoked the statcollect( ) function on a single row of the table on which statistics is being gathered. At this point, statcollect( ) reads the column value from the first argument and places it into the statistics-collection structure.
The statcollect( ) function processes the statistics on a row-by-row basis; that is, for each iterator status of SET_RETONE, statcollect( ) merges the current column value into the statistics data in the internal statistics-collection structure. Therefore, the statcollect( ) function must perform the following tasks:
The SET_RETONE case in the OpaqueStatCollect( ) function (where Opaque is the name of your opaque data type) that BladeSmith generates automatically calls the Opaque_SetMinValue( ) and Opaque_SetMinValue( ) functions to compare the current column value with the existing minimum and maximum. It then calls the Opaque_Histogram( ) function to merge the column value into the distribution array of the Opaque_stat_t statistics-collection structure. However, you must provide this code within the Opaque_SetMinValue( ) and Opaque_Histogram( ) functions to perform the actual comparisons and distribution for your Opaque data type.
After all rows are processed, statcollect( ) must transfer the statistics data from its statistics-collection structure into the predefined opaque type, stat. It is stat data that the UPDATE STATISTICS statement stores in the encdat column of the sysdistrib system catalog table.
The stat data type is a multirepresentational opaque data type; that is, it holds statistics data within its internal structure until the data reaches a predefined threshold. If the statistics data exceeds this threshold, the stat data type stores the data in a smart large object. In support of the multirepresentational data, the stat data type provides the following functions:
If the data exceeds the predefined threshold, this assign( ) function creates the smart large object and increments its reference count. The database server calls this assign( ) function just before it inserts the mi_statret structure into the encdat column of the sysdistrib table.
The database server calls this destroy( ) function just before it deletes a row from the sysdistrib system catalog table in response to the DROP DISTRIBUTION clause of the UPDATE STATISTICS statement.
For UPDATE STATISTICS to be able to store the distribution data in the encdat column, the statcollect( ) function must copy its statistics-collection structure into the stat data type.
The internal structure of the stat opaque type is a C language structure named mi_statret. The stat support functions handle most of the interaction with the mi_statret structure; however, your statcollect( ) function must fill in the mi_statret multirepresentational fields.
For an exact declaration of mi_statret, see the milo.h header file. This header file also provides the following useful declarations.
The assign( ) and destroy( ) support functions of the stat opaque type determine whether to store the distribution data directly in the encdat column or in a smart large object. In the latter case, the encdat column stores the LO handle of the smart large object. Your statcollect( ) function can use the MI_STATMAXLEN constant to determine whether it needs to handle multirepresentational data.
The MI_STATMAXLEN constant is the maximum size that the encdat column of sysdistrib can hold. Therefore, it is the maximum size of the statdata.buffer array. If your distribution data has a size less than MI_STATMAXLEN, you can take the following actions:
The assign( ) and destroy( ) support functions of the stat opaque type take care of determining whether to store the distribution data directly in the encdat column or in a smart large object whose LO handle is stored in the encdat column.
If your distribution data exceeds MI_STATMAXLEN, your statcollect( ) function must handle the multirepresentational data itself, with the following steps:
As with any user-defined function, you register the statistics-collection function with the CREATE FUNCTION statement. The registration of this function has the following requirements:
The database server handles routine resolution based on the data type of the first argument to statcollect( ). If the name of your C statistics-collection function is not statcollect( ), specify the C function name in the EXTERNAL NAME clause.
Your statistics-collection function can choose whether to include the NULL value in the statistics data that it generates.
The following CREATE FUNCTION statement registers the statistics-collection function that Figure 141 declares:
CREATE FUNCTION statcollect(ll_arg longlong, num_rows FLOAT, resolution FLOAT) RETURNING stat WITH (HANDLESNULLS) EXTERNAL NAME '/usr/udrs/bin/longlong.so(stat_collect_ll)' LANGUAGE C;
After you register the statcollect( ) function, make sure those with the DBA privilege or the table owner has the Execute privilege on the function.
To collect user-defined statistics, run the UPDATE STATISTICS statement in HIGH or MEDIUM mode. The syntax of UPDATE STATISTICS is the same for user-defined data types as for built-in data types. However, when UPDATE STATISTICS collects statistics for a user-defined type, it does not automatically determine the minimum and maximum column values (stored in the colmin and colmax columns of the syscolumns system catalog table). Your statcollect( ) function can explicitly calculate these values if desired.
The statcollect( ) function executes once for every row that the database server scans during UPDATE STATISTICS. Therefore, a database table must contain more than one row before the database server calls any statcollect( ) functions.
The number of rows that the database server scans depends on the mode and the confidence level. Executing UPDATE STATISTICS in HIGH mode causes the database server to scan all rows in the table. In MEDIUM mode the database server chooses the number of rows to scan based on the confidence level. The higher the confidence level, the higher the number of rows that the database server scans. For general information about UPDATE STATISTICS, see the IBM Informix: Guide to SQL Syntax.
For example, if the mytable table contains a column of type Box, the following UPDATE STATISTICS statement collects user-defined statistics for all columns of mytable, including any columns with user-defined statistics defined:
UPDATE STATISTICS HIGH FOR TABLE mytable;
If the mytable column contains columns with any data types that require user-defined statistics and you do not define this statistics collection, the UPDATE STATISTICS statement does not collect statistics for the column.