Predicate Evaluation

Home | Previous Page | Next Page Designing DataBlade Modules > Query Processing >

Predicate Evaluation

An expression in the qualification of a query is a predicate. The WHERE clause in the following query is a predicate:

SELECT name, boundary from cities
   WHERE Overlaps(boundary, '(1,1), (5,5), (7,7), (8,9)');

The database server evaluates the predicate for every row in the table and returns the rows that satisfy the predicate to the client. Each predicate in a qualification eliminates rows from the candidate set. After the server determines that a row does not satisfy a predicate, it moves on to the next candidate row.

Most predicate evaluation is straightforward—values of interest are extracted from a candidate row and evaluated against the predicate. However, there are some cases where predicates in the qualification behave in a unique way. These cases are described in the following sections:

Expensive Routines

Expensive routines are routines that either take a long time or require a great deal of disk space to run. Conventional relational database systems do not account for expensive routines; any predicate that appears in a query is assumed to be as expensive as any other. For example, comparing two floating-point numbers is not more difficult than comparing two integers. For relational databases, this is the right approach.

However, an object-relational database system must evaluate relative function costs. Some routines are very difficult to compute or require a great deal of intermediate space. For example, it can take many thousands of machine instructions to determine whether two polygons overlap.

Because an object-relational database stops evaluating predicates as soon as it determines that a row does not satisfy the criteria, the database server chooses an optimum order to evaluate the predicates in a query. If it evaluates all the expensive predicates first, the query runs slower than if it considers the inexpensive predicates first.

The strategy for choosing the best order to evaluate predicates is complex and beyond the scope of this discussion. However, the database server must evaluate the cost of invoking user-defined routines to run queries efficiently.

Most DataBlade module routines are at least as complex as a routine that compares floating-point numbers. For DataBlade module routines that are more expensive, you must describe the relative expense to the Informix server.

A good formula for estimating the expense of a routine is as follows:

<lines of code> + (<number of I/Os> * 100)

For example, if a routine has 100 lines of code and performs 5 disk I/Os or SQL queries, the cost is 100 + (5 * 100), or 600. You can enter the cost in the BladeSmith Routine wizard (see Cost of Routine).

When you estimate the cost of executing a routine, consider the following questions:

Which DataBlade module routines take a long time to run?
Which DataBlade module routines consume large amounts of memory or disk space?
How expensive are the DataBlade module routines relative to one another?
How expensive are the DataBlade module routines relative to expensive routines defined by other DataBlade modules?

User-Defined Statistics

User-defined statistics provide a way to improve performance when you compare opaque data type values. User-defined statistics compile information about the values in an opaque data type column that the optimizer can use when it creates a query plan when it needs to execute routines that compare opaque data type values.

Statistics typically consist of the following types of information about the specified column; however, you can collect more information if it is appropriate for your opaque data type:

Minimum value
Maximum value
Distribution of values

When your statistics-gathering function calculates the distribution of column values, it can assign each value to a "bin." Each bin contains a range of values. For example, suppose the column values range from 1 through 10. You could have five bins: the first bin would hold values from 1 through 2, the second bin would hold values from over 2 through 4, and so on. The database server generates statistics by calling your statistics-gathering function when you run the UPDATE STATISTICS statement in medium or high mode (see the IBM Informix: Guide to SQL Syntax).

Important:

You must understand your data and how users will query it to create meaningful statistics.

The minimum, maximum, and distribution of values can be used to compute the selectivity of a value. The optimizer can then use the selectivity of values when it determines query cost estimates. For example, suppose you want to join two tables. Normally, a join compares all values in one table to all values in the other table. However, if the optimizer knows that one of the tables has low selectivity, it can efficiently order the joins.

Selectivity is an estimate of the percentage of rows that will be returned by a filter in a query. Selectivity values range from 0.0 to 1.0, where 0.0 indicates a very selective filter that passes very few rows and 1.0 indicates a filter that passes almost all rows. The optimizer uses selectivity information to reorder expressions in the query predicate so that filters that are expensive to call given the values of their arguments are evaluated after filters that are inexpensive to call. Thus the optimizer reduces the number of comparisons and improves performance. To determine the selectivity of a routine, the database server calls the associated selectivity routine.

For example, suppose you have an opaque data type that represents a circle and you have created a distribution for the circle type based on the radius. Assume that the values of the radius range from 5 to 15. If you query for all circles with a radius of less than 4, the selectivity of the LessThan() function that handles the circle data type is 0 because no values qualify. Consequently, the optimizer would not execute the LessThan() function. Alternatively, if you query for all circles with a radius of greater than 4, the selectivity of the GreaterThan() function that handles the circle data type is 1.0 because all values qualify. Consequently, the optimizer would execute the GreaterThan() function after all other operations in the query predicate.

You can define selectivity routines for user-defined functions with the following characteristics:

Functions that compare two opaque data types
Functions that return a Boolean value
Functions that act as filters (called in the WHERE clause of a SELECT statement)

For example, you can define selectivity functions for the Equal(), LessThan(), and GreaterThan() functions that are overloaded for an opaque data type. You can also define a selectivity function for a function like Contains() that compares two opaque data types.

To implement user-defined statistics, you must supply the following routines:

Statistics support functions that collect statistics for opaque data types (see Statistics Support)
User-defined selectivity routines that use statistics to estimate the selectivity of a routine that compares opaque data type values (see Selectivity Functions)

After you define the routines in BladeSmith, you must add code to them to provide the required functionality. See Editing Statistics Routines in statistics.c and Selectivity Functions for instructions.

To determine whether your opaque data type needs user-defined statistics, consider the following questions:

Do you know enough about the data and how users will access it to write routines that compile meaningful statistics?
Do the routines that compare your opaque data type consume large amounts of memory or disk space?

Aggregates

Most DataBlade module functions operate on single values and return one result for each time they are called. Aggregates, however, are functions that are called repeatedly, with different values, and collect their results until no more arguments are supplied.

An example of an aggregate is the built-in AVG aggregate. This aggregate computes the average value of all its arguments. For example, an SQL user could issue the following query:

SELECT AVG(population) FROM cities;

The query processing engine calls the supporting function for AVG repeatedly with population values from the cities table. After all the populations have been passed to AVG one at a time, it returns the average population. You can extend the aggregates that are built into the database server by overloading their operator functions for an extended data type. For more information, see the IBM Informix: DataBlade API Programmer's Guide.

You can define new aggregates that implement user-defined functions. For example, one common spatial operation is to compute a minimum bounding rectangle that contains a collection of other rectangles. A user might write the following query using a user-defined aggregate called BOUNDING:

SELECT BOUNDING(boundary) FROM cities;

The BOUNDING aggregate takes all the polygons, one at a time, from the cities table and returns the smallest rectangle that contains them all. The query processing engine supplies records to the aggregate for computation; the aggregate only collects information over the arguments it is passed. For more information, see Creating Aggregates.

Like ordinary functions, aggregates may appear anywhere in the query, including in the target list and the qualification. Aggregates in the qualification are most useful in queries that also do grouping. See Grouping for more information on how aggregates work in grouping queries.

If you have a data type over which summary or statistical analyses are valuable, consider defining an aggregate.

When you design a DataBlade module, ask yourself the following question: Is it useful to compute a summary over values that the DataBlade module supports?

Sorting Results

SQL allows you to sort result rows when you express your queries. Sorted results are useful when you need to see records in some particular order.

The following query sorts a list of cities and their populations in descending order by population:

SELECT name, population FROM cities ORDER BY population desc;

If a DataBlade module defines a data type that can be sorted in a meaningful way, you must supply a comparison routine for the type. This routine allows the user to sort query results on that type.

In addition, you can use the results of routines that appear in the target list to sort the results of a query. For example, the following query returns a list of cities in descending order by population density:

SELECT name, population, 
   population / Area(boundary) AS density
   FROM cities 
   ORDER BY density desc;

The density expression, on which the query results are sorted, is a complex calculation. The expression includes a DataBlade module routine and a division operation. Because your Informix database server allows sorting by floating-point numbers, the preceding query requires no special sorting support from the DataBlade module.

To determine whether to provide sorted results, ask the following questions:

Can my DataBlade module data types be sorted?
Will users want to sort this data type?