Home | Previous Page | Next Page   Creating User-Defined Routines > Extending Data Types > Providing Statistics Data for a Column >

Collecting Statistics Data

The UPDATE STATISTICS statement collects statistics about the tables in your database. It automatically collects statistics for all columns with built-in data types (except TEXT and BYTE). However, it cannot automatically collect statistics for columns with user-defined data types because it does not know the structure of these data types.

For UPDATE STATISTICS to collect statistics for a column with a user-defined data type, you must write a user-defined function named statcollect( ) that collects statistics data for your user-defined data type. The UPDATE STATISTICS statement takes the following steps for columns of user-defined data types:

To have the UPDATE STATISTICS statement collect statistics for your user-defined data type, you must:

If a statcollect( ) function does not exist for your user-defined data type, UPDATE STATISTICS does not collect statistics data for any column of that type.

Designing the User-Defined Statistics

Before you begin to code a statcollect( ) function for a particular user-defined data type, you need to decide what it means to collect statistics on this data type. For example, consider the following issues:

Defining the Statistics-Collection Function

When you declare your statistics-collection function, it must have the following C signature:

mi_statret *statcollect(udt_arg, num_rows, resolution,
      fparam_ptr)
   udt_type *udt_arg;
   mi_double_precision *num_rows;
   mi_double_precision *resolution;
   MI_FPARAM *fparam_ptr;
udt_arg
is a pointer to the internal structure of the user-defined data type. The database server uses this argument to resolve the function and to pass in column values.
num_rows
is a pointer to a floating-point value that indicates the number of rows that the database server must scan to gather the statistics.
resolution
is a pointer to a floating-point value that is the resolution specified by the UPDATE STATISTICS statement. The resolution value specifies the bucket size for the distribution. However, you might choose to ignore this parameter if it does not make sense for your user-defined data type.
fparam_ptr
is a pointer to the MI_FPARAM structure that holds the iterator-status constant for each iteration of the statcollect( ) function.
Tip:
The statistics-collection function can have any name. It does not have to be named statcollect( ). It is recommended that you include the name of your user-defined data type in the name of the statistics-collection function to help distinguish the function from the statistics-collection functions of other user-defined data type.

Figure 141 shows a C declaration of the statistics-collection function for the longlong opaque type.

Figure 141. Sample Declaration of a Statistics-Collection Function
mi_statret *statcollect_ll(ll_arg, num_rows, resolution, fparam_ptr)
   longlong_t *ll_arg;
   mi_double_precision *num_rows;
   mi_double_precision *resolution;
   MI_FPARAM *fparam_ptr;
DBDK

BladeSmith automatically generates an OpaqueStatCollect( ) function (in which Opaque is the name of your opaque data type) with the following declaration:

mi_lvarchar *OpaqueStatCollect(Gen_pColValue,
      Gen_Resolution, Gen_RowCount, Gen_fparam)
   Opaque *Gen_pColValue;
   mi_double_precision *Gen_Resolution;
   mi_double_precision *Gen_RowCount;
   MI_FPARAM *Gen_fparam;

If this declaration is not appropriate for your opaque type, you must customize the OpaqueStatCollect( ) function.

End of DBDK

Collecting the Statistics

The statcollect( ) user-defined function is an iterator function; that is, the database server calls statcollect( ) for each of the rows on whose column of a user-defined data type UPDATE STATISTICS is collecting statistics. As with other iterator functions, the database server uses an iterator-status constant to indicate when the statistics-collection function is called.

Important:
The database server passes the value of the iterator-status constant within the MI_FPARAM structure. Therefore, your statistics-collection function must declare an MI_FPARAM structure as its last parameter. Otherwise, it cannot access the value of the iterator-status constant with the mi_fp_request( ) function.

The following table summarizes the values of the iterator-status constant for the statcollect( ) function.

When Is the statcollect( ) Function Called? What Does statcollect( ) Need To Do? Iterator-Status Constant in MI_FPARAM
The first time that statcollect( ) is called Perform any initialization operations, such as allocating memory for a statistics-collection structure and initializing values

First argument (udt_arg) is a NULL value.

SET_INIT
Once for each row for which statistics are being collected Return one item of the active set

Read the column value from the first argument (udt_arg) and place it in your statistics-collection structure.

SET_RETONE
After all rows have been processed Release iteration resources

Put the statistics in the stat data type and perform any memory deallocation

SET_END

To obtain the iterator-status constant in each iteration, your statcollect( ) function can use a switch statement on the return value of the mi_fp_request( ) function, as follows:

switch ( mi_fp_request(fparam_ptr) )
{
   case SET_INIT:
      ...
   case SET_RETONE:
      ...
   case SET_END:
      ...
}

If statcollect( ) raises an error, UPDATE STATISTICS terminates the statistics collection for that column.

The following sections summarize the steps that statcollect( ) must take for each of these iterator-status constants. For general information about iterator-status constants, see Table 102.

SET_INIT in statcollect( )

When the iterator-status constant is SET_INIT, the database server has invoked the initial call to statcollect( ). Usually, in this initial call, your statcollect( ) function allocates and initializes an internal C structure, called a statistics-collection structure. The statistics-collection structure is a holding area for the statistics data that statcollect( ) gathers on a row-by-row basis.

DBDK

BladeSmith generates the OpaqueStatCollect( ) function (in which Opaque is the name of your opaque data type), which allocates a statistics-collection structure Opaque_stat_t (declared in a file with the .h extension). This structure contains the following information.

Element of Statistics-Collection Structure Description Data Type
count Current number of rows mi_integer
max Maximum value mi_integer
min Minimum value mi_integer
distribution[] An array to hold the "in-progress" statistics data An array of mi_integer values whose size is the number of elements that can fit into the text distribution area (usually 256 bytes)

BladeSmith generates statistics code under the assumption that the minimum, maximum, and distribution of values are appropriate for your opaque data type. The SET_INIT case in the OpaqueStatCollect( ) function calls the Opaque_SetMaxValue( ) and Opaque_SetMaxValue( ) functions (which you must implement) to initialize maximum and minimum values, respectively. It initializes the current row count and the elements of the distribution array to zero (0).

If this statistics data is not appropriate for your opaque type, take the following actions:

End of DBDK

Your statcollect( ) function can use the MI_FPARAM structure to store this statistics-collection structure (and any other state information) between iterations of statcollect( ). Allocate any memory used across multiple iterations of statcollect( ) from the PER_COMMAND pool and free it as soon as possible. Allocate any memory not used across multiple invocations of statcollect( ) from the PER_ROUTINE memory pool.

Use the mi_fp_setfuncstate( ) function to save a pointer to the user-state memory in the MI_FPARAM structure of your statcollect( ) function. For more information, see Saving a User State.

SET_RETONE in statcollect( )

For each row of a table, the statcollect( ) function collects the statistics data for the column that has the user-defined data type. When the iterator-status constant is SET_RETONE, the database server has invoked the statcollect( ) function on a single row of the table on which statistics is being gathered. At this point, statcollect( ) reads the column value from the first argument and places it into the statistics-collection structure.

The statcollect( ) function processes the statistics on a row-by-row basis; that is, for each iterator status of SET_RETONE, statcollect( ) merges the current column value into the statistics data in the internal statistics-collection structure. Therefore, the statcollect( ) function must perform the following tasks:

DBDK

The SET_RETONE case in the OpaqueStatCollect( ) function (where Opaque is the name of your opaque data type) that BladeSmith generates automatically calls the Opaque_SetMinValue( ) and Opaque_SetMinValue( ) functions to compare the current column value with the existing minimum and maximum. It then calls the Opaque_Histogram( ) function to merge the column value into the distribution array of the Opaque_stat_t statistics-collection structure. However, you must provide this code within the Opaque_SetMinValue( ) and Opaque_Histogram( ) functions to perform the actual comparisons and distribution for your Opaque data type.

End of DBDK
SET_END in statcollect( )

After all rows are processed, statcollect( ) must transfer the statistics data from its statistics-collection structure into the predefined opaque type, stat. It is stat data that the UPDATE STATISTICS statement stores in the encdat column of the sysdistrib system catalog table.

The stat data type is a multirepresentational opaque data type; that is, it holds statistics data within its internal structure until the data reaches a predefined threshold. If the statistics data exceeds this threshold, the stat data type stores the data in a smart large object. In support of the multirepresentational data, the stat data type provides the following functions:

For UPDATE STATISTICS to be able to store the distribution data in the encdat column, the statcollect( ) function must copy its statistics-collection structure into the stat data type.

The internal structure of the stat opaque type is a C language structure named mi_statret. The stat support functions handle most of the interaction with the mi_statret structure; however, your statcollect( ) function must fill in the mi_statret multirepresentational fields.

For an exact declaration of mi_statret, see the milo.h header file. This header file also provides the following useful declarations.

Declaration
Purpose
mi_stat_buf
#define for the statdata.buffer field
mi_stat_mr
#define for the statdata.mr field
MI_STATMAXLEN
Constant for the size of the statdata.buffer field
mi_stat_hdrsize
Size of the information in the mi_statret structure that is not holding the statistics data (size of all fields except the statdata field)

The assign( ) and destroy( ) support functions of the stat opaque type determine whether to store the distribution data directly in the encdat column or in a smart large object. In the latter case, the encdat column stores the LO handle of the smart large object. Your statcollect( ) function can use the MI_STATMAXLEN constant to determine whether it needs to handle multirepresentational data.

The MI_STATMAXLEN constant is the maximum size that the encdat column of sysdistrib can hold. Therefore, it is the maximum size of the statdata.buffer array. If your distribution data has a size less than MI_STATMAXLEN, you can take the following actions:

The assign( ) and destroy( ) support functions of the stat opaque type take care of determining whether to store the distribution data directly in the encdat column or in a smart large object whose LO handle is stored in the encdat column.

If your distribution data exceeds MI_STATMAXLEN, your statcollect( ) function must handle the multirepresentational data itself, with the following steps:

  1. Create a new smart large object.
  2. Copy the data from the statistics-collection structure into the new smart large object.
  3. Copy the LO handle of this smart large object into the statdata.mr.mr_lo_struct.mr_s_lo field.
  4. Set the statdata.szind field to MI_MULTIREP_LARGE to indicate that the multirepresentational data is stored in a smart large object.

Registering the statcollect( ) Function

As with any user-defined function, you register the statistics-collection function with the CREATE FUNCTION statement. The registration of this function has the following requirements:

Parameter Number
Parameter Data Type
1
SQL name for the user-defined data type
2
FLOAT
3
FLOAT

The following CREATE FUNCTION statement registers the statistics-collection function that Figure 141 declares:

CREATE FUNCTION statcollect(ll_arg longlong, num_rows FLOAT,
   resolution FLOAT)
RETURNING stat
WITH (HANDLESNULLS)
EXTERNAL NAME '/usr/udrs/bin/longlong.so(stat_collect_ll)'
LANGUAGE C;

After you register the statcollect( ) function, make sure those with the DBA privilege or the table owner has the Execute privilege on the function.

Executing the UPDATE STATISTICS Statement

To collect user-defined statistics, run the UPDATE STATISTICS statement in HIGH or MEDIUM mode. The syntax of UPDATE STATISTICS is the same for user-defined data types as for built-in data types. However, when UPDATE STATISTICS collects statistics for a user-defined type, it does not automatically determine the minimum and maximum column values (stored in the colmin and colmax columns of the syscolumns system catalog table). Your statcollect( ) function can explicitly calculate these values if desired.

The statcollect( ) function executes once for every row that the database server scans during UPDATE STATISTICS. Therefore, a database table must contain more than one row before the database server calls any statcollect( ) functions.

The number of rows that the database server scans depends on the mode and the confidence level. Executing UPDATE STATISTICS in HIGH mode causes the database server to scan all rows in the table. In MEDIUM mode the database server chooses the number of rows to scan based on the confidence level. The higher the confidence level, the higher the number of rows that the database server scans. For general information about UPDATE STATISTICS, see the IBM Informix: Guide to SQL Syntax.

For example, if the mytable table contains a column of type Box, the following UPDATE STATISTICS statement collects user-defined statistics for all columns of mytable, including any columns with user-defined statistics defined:

UPDATE STATISTICS HIGH FOR TABLE mytable;

If the mytable column contains columns with any data types that require user-defined statistics and you do not define this statistics collection, the UPDATE STATISTICS statement does not collect statistics for the column.

Important:
The statistics that the database server collects might require a smart large object for storage. For the database server to use user-defined statistics, the configuration parameter SYSSBSPACENAME must be set in the ONCONFIG file before the database server is initialized. This configuration parameter must specify the name of an existing sbspace. If SBSSPACENAME is not set, the database server might not be able to collect the specified statistics.
Home | [ Top of Page | Previous Page | Next Page | Contents | Index ]