Optimizing Functions in SQL Statements

Extending Informix Dynamic Server 2000
Improving UDR Performance

Optimizing Functions in SQL Statements

The optimizer by itself cannot evaluate the cost of executing a function in an SQL statement because of the possibility of complex logic, user-defined types, and so on. Because some functions can be expensive to execute, the creator of the function should provide information about the cost and selectivity of the function to help in optimizing the SQL statement.

For example, the following SQL statement includes two functions:

SELECT * FROM T WHERE expensive(t1) and cheap(t2);

If the cheap() function is less expensive to execute than the expensive() function, the optimizer should place the cheap() function first in the execution plan.

The UDRs discussed in the following sections appear in the WHERE or HAVING clause of an SQL statement. These UDRs return a value of TRUE or FALSE.

Calculating the Query Plan

The optimizer computes the cost for all possible plans and then chooses the lowest-cost plan. Cost includes the number of disk accesses, the number of network accesses, and the amount of work in memory to access rows and sort data.

Selectivity is also a factor in the total cost. Selectivity is the percentage of rows that pass the filter. The optimizer expresses the selectivity as a number from 0 to 1, which represents the percentage of rows in the table that pass the filter.

The larger the selectivity value, the less likely that a row will disqualify the filter. Therefore, the database server generally evaluates a UDR with a smaller selectivity value before it evaluates a UDR with a larger selectivity value. Similarly, the database server generally evaluates a lower-cost UDR before a higher-cost one. The ultimate order of UDR filter evaluation depends on a combination of the cost and selectivity of the UDR.

For more information on how the optimizer calculates the query plan, refer to the Performance Guide.

Specifying Cost and Selectivity

You can provide the cost and selectivity of the function to the optimizer. The database server uses cost and selectivity together to determine the best path.

To provide the cost and selectivity for a function, include modifiers in the CREATE FUNCTION statement. You can include the cost and selectivity values in the CREATE FUNCTION statement or calculate the values with functions called during the optimization phase.

If you do not specify your own cost and selectivity values for a function, the database server uses a default selectivity of 0.1 and a default cost of 0. Because the default cost and selectivity are low, the database server considers a UDR with default cost and selectivity inexpensive.

to execute and will most likely execute that UDR before other UDRs in the WHERE clause.

The database server assigns a cost of 0 to all built-in functions, such as SIN and DATE.

Constant Cost and Selectivity Values

The following modifiers specify a cost or selectivity value when you execute the CREATE FUNCTION statement. The cost or selectivity value does not change for each invocation of the function:

percall_cost=integer

The percall_cost modifier specifies the cost of executing the function once. The integer value is a number.

selconst=float

The selconst modifier specifies the selectivity of a function. The float value is a floating-point number between 0 and 1 that represents the fraction of the rows for which you expect the routine to return TRUE.

For some functions, specifying the same cost for all invocations of the function is sufficient. For example, finding the absolute value of -68327 is no more time-consuming than finding the absolute value of -4. In these cases, a constant value that estimates the cost is appropriate.

Dynamic Cost and Selectivity Values

In some cases, the cost and selectivity of a function can vary significantly, depending upon the input to the function. If the input can change the optimization, use the following modifiers, which execute a function to compute the cost and selectivity at runtime:

costfunc=CostFunction

This modifier specifies the name of a function, CostFunction, that the optimizer executes to find the cost of executing your function one time.

selfunc=SelectivityFunction

This modifier specifies the name of a function, SelectivityFunction, that the optimizer executes to find the selectivity of your function.

You write these cost and selectivity functions to provide the optimizer with enough information about your function to create the best query plan.

The selectivity functions for a user-defined data type (UDT) might need statistics about the nature of the data in the UDT column. The database server does not generate distributions or maximum and minimum value statistics for a UDT. You need to write and register user-defined statistics functions to generate and store statistics for a UDT in the system catalog tables, in the same locations as statistics stored for built-in data types. For more information about user-defined statistics, refer to Extending UPDATE STATISTICS.

Calculating Cost

The cost you specify for a function must be compatible with the cost that the optimizer calculates for other parts of the SQL statement. The following formula is one method to approximate the costing algorithm that the optimizer uses:

Execute the following SQL statements from DB-Access, where bigtable is any large table:

SET EXPLAIN ON;

SELECT count(*) from bigtable;

Time the query.

Let secost be the cost the optimizer assigned for the scan. Read the sqexplain.out file to get secost.

For information about sqexplain.out, refer to the Performance Guide.

Let sacost be the time required to complete the SQL statement.

Execute and time your function. Let facost be the actual time required to execute the function once.

The cost of executing the function once can be approximated as follows: ((secost/sacost)*facost)
Truncate the calculated cost to an integer value.

Selectivity and Cost Examples

The following example creates a function that determines if a point is within a circle. When an SQL statement contains this function, the optimizer executes the function contains_sel() to determine the selectivity of the contains() function.

CREATE FUNCTION contains(c circle, p point)
RETURNING boolean WITH(selfunc=contains_sel)
EXTERNAL NAME "$USERFUNCDIR/circle.so" LANGUAGE C;

The following example creates two functions, each with cost and selectivity values:

CREATE FUNCTION expensive(cust int)
RETURNING boolean WITH(percall_cost=50,selconst=.1)
EXTERNAL NAME "/ix/9.2/exp_func.so" LANGUAGE c;

CREATE FUNCTION cheap(cust int)
RETURNING boolean WITH(percall_cost=1,selconst=.1)
EXTERNAL NAME "/ix/9.2/exp_func.so" LANGUAGE c;

When both of these functions are in one SQL statement, the optimizer executes the cheap() function first because of the lower cost. The following SET EXPLAIN output, which lists cheap() first in the Filters: line, shows that indeed the optimizer did execute cheap() first:

QUERY:
------
select * from customer
where expensive(customer_num)
and cheap(customer_num)
Estimated Cost: 8
Estimated # of Rows Returned: 1
  1) informix.customer: SEQUENTIAL SCAN
        Filters: (lsuto.cheap(informix.customer.customer_num 
)AND lsuto.expensive(informix.customer.customer_num ))

For an example of a C function that calculates a cost dynamically, refer to the \%INFORMIXDIR\dbdk\examples\Types\dapi\Statistics\Box\src\c directory after you install the DataBlade Developers Kit.

Extending Informix Dynamic Server 2000Improving UDR Performance