Collecting Statistics Data

Home | Previous Page | Next Page Creating User-Defined Routines > Extending Data Types > Providing Statistics Data for a Column >

Collecting Statistics Data

The UPDATE STATISTICS statement collects statistics about the tables in your database. It automatically collects statistics for all columns with built-in data types (except TEXT and BYTE). However, it cannot automatically collect statistics for columns with user-defined data types because it does not know the structure of these data types.

For UPDATE STATISTICS to collect statistics for a column with a user-defined data type, you must write a user-defined function named statcollect( ) that collects statistics data for your user-defined data type. The UPDATE STATISTICS statement takes the following steps for columns of user-defined data types:

Calls the statcollect( ) function that handles the user-defined data type
This statcollect( ) function gathers the statistics data for the column and stores it as the stat opaque data type.
Stores this stat data type in the sysdistrib system catalog table, where the statistics data can be accessed by the query optimizer
UPDATE STATISTICS stores the following information in the row of the sysdistrib table that corresponds to the user-defined-type column:
- In the encdat column of the sysdistrib row: the stat data type that statcollect( ) returns
- In the type column of the sysdistrib row: an 'S' to indicate that the encdat column contains user-defined statistics

To have the UPDATE STATISTICS statement collect statistics for your user-defined data type, you must:

Design the statistics information that is appropriate for your user-defined data type.
Define a C statistics-collection function to implement the statistics collection.
Collect the statistics for the column within this statistics-collection function.
Register this C function as a statcollect( ) user-defined function.

If a statcollect( ) function does not exist for your user-defined data type, UPDATE STATISTICS does not collect statistics data for any column of that type.

Designing the User-Defined Statistics

Before you begin to code a statcollect( ) function for a particular user-defined data type, you need to decide what it means to collect statistics on this data type. For example, consider the following issues:

Do the values of the user-defined type have some ordering?
To be able to group the values into bins of related values, the data must have some kind of implied sequence. A common use of statistics information is within a selectivity function for a query filter such as "less than" or "greater than". If the values of the user-defined data type do not have ordering, they would not logically be used in such filters. For more info, see Query Selectivity.
How does the distribution handle SQL NULL values?
For example, the distribution can ignore NULL values or it could aggregate them. However, the handling of the NULL values should make sense to the user-defined data type.

Defining the Statistics-Collection Function

When you declare your statistics-collection function, it must have the following C signature:

mi_statret *statcollect(udt_arg, num_rows, resolution,
      fparam_ptr)
   udt_type *udt_arg;
   mi_double_precision *num_rows;
   mi_double_precision *resolution;
   MI_FPARAM *fparam_ptr;

udt_arg: is a pointer to the internal structure of the user-defined data type. The database server uses this argument to resolve the function and to pass in column values.
num_rows: is a pointer to a floating-point value that indicates the number of rows that the database server must scan to gather the statistics.
resolution: is a pointer to a floating-point value that is the resolution specified by the UPDATE STATISTICS statement. The resolution value specifies the bucket size for the distribution. However, you might choose to ignore this parameter if it does not make sense for your user-defined data type.
fparam_ptr: is a pointer to the MI_FPARAM structure that holds the iterator-status constant for each iteration of the statcollect( ) function.
Tip:

The statistics-collection function can have any name. It does not have to be named statcollect( ). It is recommended that you include the name of your user-defined data type in the name of the statistics-collection function to help distinguish the function from the statistics-collection functions of other user-defined data type.

Figure 141 shows a C declaration of the statistics-collection function for the longlong opaque type.

Figure 141. Sample Declaration of a Statistics-Collection Function

mi_statret *statcollect_ll(ll_arg, num_rows, resolution, fparam_ptr)
   longlong_t *ll_arg;
   mi_double_precision *num_rows;
   mi_double_precision *resolution;
   MI_FPARAM *fparam_ptr;

DBDK

BladeSmith automatically generates an OpaqueStatCollect( ) function (in which Opaque is the name of your opaque data type) with the following declaration:

mi_lvarchar *OpaqueStatCollect(Gen_pColValue,
      Gen_Resolution, Gen_RowCount, Gen_fparam)
   Opaque *Gen_pColValue;
   mi_double_precision *Gen_Resolution;
   mi_double_precision *Gen_RowCount;
   MI_FPARAM *Gen_fparam;

If this declaration is not appropriate for your opaque type, you must customize the OpaqueStatCollect( ) function.

End of DBDK

Collecting the Statistics

The statcollect( ) user-defined function is an iterator function; that is, the database server calls statcollect( ) for each of the rows on whose column of a user-defined data type UPDATE STATISTICS is collecting statistics. As with other iterator functions, the database server uses an iterator-status constant to indicate when the statistics-collection function is called.

Important:

The database server passes the value of the iterator-status constant within the MI_FPARAM structure. Therefore, your statistics-collection function must declare an MI_FPARAM structure as its last parameter. Otherwise, it cannot access the value of the iterator-status constant with the mi_fp_request( ) function.

The following table summarizes the values of the iterator-status constant for the statcollect( ) function.

When Is the statcollect( ) Function Called?	What Does statcollect( ) Need To Do?	Iterator-Status Constant in MI_FPARAM
The first time that statcollect( ) is called	Perform any initialization operations, such as allocating memory for a statistics-collection structure and initializing values First argument (udt_arg) is a NULL value.	SET_INIT
Once for each row for which statistics are being collected	Return one item of the active set Read the column value from the first argument (udt_arg) and place it in your statistics-collection structure.	SET_RETONE
After all rows have been processed	Release iteration resources Put the statistics in the stat data type and perform any memory deallocation	SET_END

To obtain the iterator-status constant in each iteration, your statcollect( ) function can use a switch statement on the return value of the mi_fp_request( ) function, as follows:

switch ( mi_fp_request(fparam_ptr) )
{
   case SET_INIT:
      ...
   case SET_RETONE:
      ...
   case SET_END:
      ...
}

If statcollect( ) raises an error, UPDATE STATISTICS terminates the statistics collection for that column.

The following sections summarize the steps that statcollect( ) must take for each of these iterator-status constants. For general information about iterator-status constants, see Table 102.

SET_INIT in statcollect( )

When the iterator-status constant is SET_INIT, the database server has invoked the initial call to statcollect( ). Usually, in this initial call, your statcollect( ) function allocates and initializes an internal C structure, called a statistics-collection structure. The statistics-collection structure is a holding area for the statistics data that statcollect( ) gathers on a row-by-row basis.

DBDK

BladeSmith generates the OpaqueStatCollect( ) function (in which Opaque is the name of your opaque data type), which allocates a statistics-collection structure Opaque_stat_t (declared in a file with the .h extension). This structure contains the following information.

Element of Statistics-Collection Structure	Description	Data Type
count	Current number of rows	mi_integer
max	Maximum value	mi_integer
min	Minimum value	mi_integer
distribution[]	An array to hold the "in-progress" statistics data	An array of mi_integer values whose size is the number of elements that can fit into the text distribution area (usually 256 bytes)

BladeSmith generates statistics code under the assumption that the minimum, maximum, and distribution of values are appropriate for your opaque data type. The SET_INIT case in the OpaqueStatCollect( ) function calls the Opaque_SetMaxValue( ) and Opaque_SetMaxValue( ) functions (which you must implement) to initialize maximum and minimum values, respectively. It initializes the current row count and the elements of the distribution array to zero (0).

If this statistics data is not appropriate for your opaque type, take the following actions:

Define your own statistics-collection structure to hold statistics data.
Allocate and initialize this statistics-collection structure within the SET_INIT case of your statcollect( ) function.

End of DBDK

Your statcollect( ) function can use the MI_FPARAM structure to store this statistics-collection structure (and any other state information) between iterations of statcollect( ). Allocate any memory used across multiple iterations of statcollect( ) from the PER_COMMAND pool and free it as soon as possible. Allocate any memory not used across multiple invocations of statcollect( ) from the PER_ROUTINE memory pool.

Use the mi_fp_setfuncstate( ) function to save a pointer to the user-state memory in the MI_FPARAM structure of your statcollect( ) function. For more information, see Saving a User State.

SET_RETONE in statcollect( )

For each row of a table, the statcollect( ) function collects the statistics data for the column that has the user-defined data type. When the iterator-status constant is SET_RETONE, the database server has invoked the statcollect( ) function on a single row of the table on which statistics is being gathered. At this point, statcollect( ) reads the column value from the first argument and places it into the statistics-collection structure.

The statcollect( ) function processes the statistics on a row-by-row basis; that is, for each iterator status of SET_RETONE, statcollect( ) merges the current column value into the statistics data in the internal statistics-collection structure. Therefore, the statcollect( ) function must perform the following tasks:

Obtain the address of the statistics-collection structure from the user state of the MI_FPARAM structure with the mi_fp_funcstate( ) function.
Compare the current column value with the current maximum and minimum values (if maximum and minimum are desired).
Merge the current column value into the distribution data in the statistics-collection structure.
Handle a NULL column value as appropriate for your distribution.

DBDK

The SET_RETONE case in the OpaqueStatCollect( ) function (where Opaque is the name of your opaque data type) that BladeSmith generates automatically calls the Opaque_SetMinValue( ) and Opaque_SetMinValue( ) functions to compare the current column value with the existing minimum and maximum. It then calls the Opaque_Histogram( ) function to merge the column value into the distribution array of the Opaque_stat_t statistics-collection structure. However, you must provide this code within the Opaque_SetMinValue( ) and Opaque_Histogram( ) functions to perform the actual comparisons and distribution for your Opaque data type.

End of DBDK

SET_END in statcollect( )

After all rows are processed, statcollect( ) must transfer the statistics data from its statistics-collection structure into the predefined opaque type, stat. It is stat data that the UPDATE STATISTICS statement stores in the encdat column of the sysdistrib system catalog table.

The stat data type is a multirepresentational opaque data type; that is, it holds statistics data within its internal structure until the data reaches a predefined threshold. If the statistics data exceeds this threshold, the stat data type stores the data in a smart large object. In support of the multirepresentational data, the stat data type provides the following functions:

An assign( ) support function, which is responsible for determining whether or not the statistics data is to be stored in a smart large object
If the data exceeds the predefined threshold, this assign( ) function creates the smart large object and increments its reference count. The database server calls this assign( ) function just before it inserts the mi_statret structure into the encdat column of the sysdistrib table.
A destroy( ) support function, which is responsible for deleting any smart large object that might exist to hold the statistics data
The database server calls this destroy( ) function just before it deletes a row from the sysdistrib system catalog table in response to the DROP DISTRIBUTION clause of the UPDATE STATISTICS statement.

For UPDATE STATISTICS to be able to store the distribution data in the encdat column, the statcollect( ) function must copy its statistics-collection structure into the stat data type.

The internal structure of the stat opaque type is a C language structure named mi_statret. The stat support functions handle most of the interaction with the mi_statret structure; however, your statcollect( ) function must fill in the mi_statret multirepresentational fields.

For an exact declaration of mi_statret, see the milo.h header file. This header file also provides the following useful declarations.

Declaration: Purpose
mi_stat_buf: #define for the statdata.buffer field
mi_stat_mr: #define for the statdata.mr field
MI_STATMAXLEN: Constant for the size of the statdata.buffer field
mi_stat_hdrsize: Size of the information in the mi_statret structure that is not holding the statistics data (size of all fields except the statdata field)

The assign( ) and destroy( ) support functions of the stat opaque type determine whether to store the distribution data directly in the encdat column or in a smart large object. In the latter case, the encdat column stores the LO handle of the smart large object. Your statcollect( ) function can use the MI_STATMAXLEN constant to determine whether it needs to handle multirepresentational data.

The MI_STATMAXLEN constant is the maximum size that the encdat column of sysdistrib can hold. Therefore, it is the maximum size of the statdata.buffer array. If your distribution data has a size less than MI_STATMAXLEN, you can take the following actions:

Copy the data from the statistics-collection structure directly into the statdata.buffer field.
Set the statdata.szind field to MI_MULTIREP_SMALL to indicate that the multirepresentational data is not stored in a smart large object but is in the mi_statret structure.

The assign( ) and destroy( ) support functions of the stat opaque type take care of determining whether to store the distribution data directly in the encdat column or in a smart large object whose LO handle is stored in the encdat column.

If your distribution data exceeds MI_STATMAXLEN, your statcollect( ) function must handle the multirepresentational data itself, with the following steps:

Create a new smart large object.
Copy the data from the statistics-collection structure into the new smart large object.
Copy the LO handle of this smart large object into the statdata.mr.mr_lo_struct.mr_s_lo field.
Set the statdata.szind field to MI_MULTIREP_LARGE to indicate that the multirepresentational data is stored in a smart large object.

Registering the statcollect( ) Function

As with any user-defined function, you register the statistics-collection function with the CREATE FUNCTION statement. The registration of this function has the following requirements:

You must name this user-defined function statcollect.
The database server handles routine resolution based on the data type of the first argument to statcollect( ). If the name of your C statistics-collection function is not statcollect( ), specify the C function name in the EXTERNAL NAME clause.
You must declare the statcollect( ) function with HANDLESNULLS routine modifier.
Your statistics-collection function can choose whether to include the NULL value in the statistics data that it generates.
The data types of the parameters must be as follows.

Parameter Number

Parameter Data Type

1

SQL name for the user-defined data type

2

FLOAT

3

FLOAT

Do not include the declaration of the MI_FPARAM structure in the SQL registration.
The function must return a value of type stat.

The following CREATE FUNCTION statement registers the statistics-collection function that Figure 141 declares:

CREATE FUNCTION statcollect(ll_arg longlong, num_rows FLOAT,
   resolution FLOAT)
RETURNING stat
WITH (HANDLESNULLS)
EXTERNAL NAME '/usr/udrs/bin/longlong.so(stat_collect_ll)'
LANGUAGE C;

After you register the statcollect( ) function, make sure those with the DBA privilege or the table owner has the Execute privilege on the function.

Executing the UPDATE STATISTICS Statement

To collect user-defined statistics, run the UPDATE STATISTICS statement in HIGH or MEDIUM mode. The syntax of UPDATE STATISTICS is the same for user-defined data types as for built-in data types. However, when UPDATE STATISTICS collects statistics for a user-defined type, it does not automatically determine the minimum and maximum column values (stored in the colmin and colmax columns of the syscolumns system catalog table). Your statcollect( ) function can explicitly calculate these values if desired.

The statcollect( ) function executes once for every row that the database server scans during UPDATE STATISTICS. Therefore, a database table must contain more than one row before the database server calls any statcollect( ) functions.

The number of rows that the database server scans depends on the mode and the confidence level. Executing UPDATE STATISTICS in HIGH mode causes the database server to scan all rows in the table. In MEDIUM mode the database server chooses the number of rows to scan based on the confidence level. The higher the confidence level, the higher the number of rows that the database server scans. For general information about UPDATE STATISTICS, see the IBM Informix: Guide to SQL Syntax.

For example, if the mytable table contains a column of type Box, the following UPDATE STATISTICS statement collects user-defined statistics for all columns of mytable, including any columns with user-defined statistics defined:

UPDATE STATISTICS HIGH FOR TABLE mytable;

If the mytable column contains columns with any data types that require user-defined statistics and you do not define this statistics collection, the UPDATE STATISTICS statement does not collect statistics for the column.

Important:

The statistics that the database server collects might require a smart large object for storage. For the database server to use user-defined statistics, the configuration parameter SYSSBSPACENAME must be set in the ONCONFIG file before the database server is initialized. This configuration parameter must specify the name of an existing sbspace. If SBSSPACENAME is not set, the database server might not be able to collect the specified statistics.