DataBlade Developers Kit User's Guide
Chapter 2: Designing DataBlade Modules
Home
Contents
Index
Master Index
New Book
Query Processing
To develop a DataBlade module, you need a general understanding of query processing and
SQL
in Informix Dynamic Server. You must also understand the execution environment inside Informix Dynamic Server-the multithreading model, the collection of processes in which DataBlade module routines can execute, and concurrent access to database objects, transactions, and so on. This section describes query processing.
Predicate Evaluation
An expression in the qualification of a query is a
predicate
. The
WHERE
clause in the following query is a predicate:
select name, boundary from cities
where Overlaps(boundary, '(1,1), (5,5), (7,7), (8,9)');
The database server evaluates the predicate for every row in the table and returns the rows that satisfy the predicate to the client. Each predicate in a qualification eliminates rows from the candidate set. After the server determines that a row does not satisfy a predicate, it moves on to the next candidate row.
Most predicate evaluation is straightforward-values of interest are extracted from a candidate row and evaluated against the predicate. However, there are some cases where predicates in the qualification behave in a unique way. These cases are described in the following sections.
Expensive Routines
Expensive routines
are routines that either take a long time or require a great deal of disk space to run. Conventional relational database systems do not account for expensive routines; any predicate that appears in a query is assumed to be as expensive as any other. For example, comparing two floating-point numbers is not more difficult than comparing two integers. For relational databases, this is the right approach.
However, an object-relational database system must evaluate relative function costs. Some routines are very difficult to compute or require a great deal of intermediate space. For example, it can take many thousands of machine instructions to determine whether two polygons overlap.
Because an object-relational database stops evaluating predicates as soon as it determines that a row does not satisfy the criteria, the database server chooses an optimum order to evaluate the predicates in a query. If it evaluates all the expensive predicates first, the query runs slower than if it considers the inexpensive predicates first.
The strategy for choosing the best order to evaluate predicates is complex and beyond the scope of this discussion. However, the database server must evaluate the cost of invoking user-defined routines to run queries efficiently.
Most DataBlade module routines are at least as complex as a routine that compares floating-point numbers. For DataBlade module routines that are more expensive, you must describe the relative expense to the Informix server.
A good formula for estimating the expense of a routine is as follows:
<lines of code> + (<number of I/Os> * 100)
For example, if a routine has 100 lines of code and performs 5 disk
I/O
s or SQL queries, the cost is 100 + (5 * 100), or 600.
When you estimate the cost of executing a routine, consider the following questions:
Which DataBlade module routines take a long time to run?
Which DataBlade module routines consume large amounts of memory or disk space?
How expensive are the DataBlade module routines relative to one another?
How expensive are the DataBlade module routines relative to expensive routines defined by other DataBlade modules?
Sorting
SQL allows you to sort result rows when you express your queries. Sorted results are useful when you need to see records in some particular order.
The following query sorts a list of cities and their populations in descending order by population:
select name, population from cities order by population desc;
If a DataBlade module defines a data type that can be sorted in a meaningful way, you must supply a comparison routine for the type. This routine allows the user to sort query results on that type.
In addition, you can use the results of routines that appear in the target list to sort the results of a query. For example, the following query returns a list of cities in descending order by population density:
select name, population,
population / Area(boundary) as density
from cities
order by density desc;
The density expression, on which the query results are sorted, is a complex calculation. The expression includes a DataBlade module routine and a division operation. Because Informix Dynamic Server includes support for sorting by floating-point numbers, the preceding query requires no special sorting support from the DataBlade module.
Ask the following questions:
Can my DataBlade module data types be sorted?
Will users want to sort this data type?
Grouping
SQL allows you to write queries that group results. Grouping is a powerful facility for summarizing data, particularly in combination with aggregates such as
COUNT
or
SUM
. The following query uses grouping and aggregates:
select COUNT(name), population from cities group by population;
This query returns the number of cities that have the same population for each distinct population value that appears in the table. The
GROUP BY
clause breaks the set of result rows into groups with equal populations; then the target list is evaluated for each group separately. The
COUNT
aggregate counts the number of city names in the group.
Consider whether any of the types you define are candidates for grouping. In the SimpleMap DataBlade module, for example, polygons are a poor candidate; users seldom want to group geographic data that contains identical polygons.
You can group results using complex expressions. For example, the following query divides cities into groups that are within 10 units of the same area and then adds the population for the group:
select Area(boundary) / 10 as dimensions, SUM(population)
from cities group by dimensions;
To determine whether your DataBlade module requires support routines for grouping, ask the following questions:
For each type in the DataBlade module, can the values sensibly be broken into groups that are equivalent?
What is the meaning of each of these groups?
Do users want to group values in that way?
Casts
If your DataBlade module defines types that are similar or comparable, consider defining casts between the types. You can also define casts from DataBlade module types to built-in types, and from data types in one DataBlade module to data types in another DataBlade module.
Casting values allows the query processing engine, implicitly or explicitly, to change the type of a value and use it as an argument to routines that require the destination type.
In an inheritance hierarchy, casting can provide another mechanism for type conversion. In general, subclasses can be implicitly cast to superclass types. However, downward casts (that is, from supertype to subtype) are not automatically supported because subclasses typically add instance variables not present in the supertype.
Similarly, distinct types can often be cast to their source type. For example, a distinct type called
LIRA
(representing the currency unit of Italy), based on
MONEY
data type, might allow casting to
MONEY
to allow simple math operations on it. However, you probably do not want to cast
MONEY
to
LIRA
; if
LIRA
has only the properties of
MONEY
, it is not a required type.
Casts can be confusing if overused. Implicit casts hide an important fact from users-that of data can be lost during type conversion. Explicit casts, which users must specify in queries, do not have this problem.
Use casts only where necessary. Be sparing in the casts you supply to users, and be sure you understand the circumstances under which you expect casts to be used.
To determine whether to provide casts, ask the following questions:
Are any of the types in the DataBlade module comparable? Do they really need to be different types? If so, is there a need to support explicit or implicit casts between those types?
Will users want to convert between values of one type and some other type, either an Informix built-in type or one defined by the DataBlade module?
Which direction should the conversion go (in the example earlier in this section, from
LIRA
to
MONEY
, or from
MONEY
to
LIRA
)? In general, casts should only go one way, unless you intend them to be explicit.
Access Path Selection
During query processing, Informix Dynamic Server takes a
nonprocedural
query and produces a
procedural
plan for satisfying it; this process is called
making an access path selection
. Queries are nonprocedural because they describe only the records of interest and what operations to perform on them. They do not prescribe an algorithm (procedure) for locating records on disk or the order in which to process them.
Informix Dynamic Server evaluates a collection of possible
query plans
that can execute the query correctly. The server estimates the cost of running each plan and chooses the one with the smallest cost estimate. Cost estimates are a combination of the number of expected disk
I/O
s, the expected number of records that must be processed, and the cost of invoking each of the routines in the query on each candidate row.
Unordered Row Processing
When you design a DataBlade module, you cannot control how queries are executed. There is no guarantee that the routines in a query are called in any particular order. DataBlade module routines are called during query processing to compute answers to queries. Do not hard-code query execution strategies. For example, an attempt to force an index scan or a sequential table scan reduces the number of choices available to the query optimizer and results in poor performance.
To ensure that your DataBlade module does not conflict with the query processing engine, ask the following questions:
Do any routines require values to be delivered in some particular order? If so, the routines break a fundamental rule of relational database systems and must be changed.
Is it important for routines in a query to execute in some particular sequence? Again, the routines must be changed.
Secondary Access Methods
A
secondary access method
is an index that allows queries to be evaluated more efficiently. When you create a table in SQL, you can choose to create a B-tree index on one or more columns in the table. The query processing engine can choose to use the index. For example, if there is an index on the population column of the
cities
table, the query processing engine has at least two choices for evaluating the following query:
select name from cities where population > 1000000;
The query processing engine can scan the
cities
table sequentially, examining each record in turn and comparing the population to one million, or it can use the B-tree index to quickly find only those records with populations of more than one million. When it chooses to use the B-tree index, the engine does not consider records with smaller populations and does not read them from the disk.
The B-tree index stores the key value (for example, the population) and a pointer to the record in the base table. The base table is the primary store, and the index is a secondary access method.
You can define many types of indexes. For example, most text search engines use a textual index to run searches quickly, while spatial data can be indexed in a number of ways, including grid files and R-trees.
You can allow the creation of other indexes on your data types. For example, a DataBlade module that defines a new type that can be sorted can allow users to create B-tree indexes on that type. To do so, you create an
operator class
for the type. An operator class is a collection of routines that allows the type to be used in a given access method. For example, the operator class for B-trees includes the routines
LessThan
,
LessThanOrEqual
,
Equal
,
GreaterThanOrEqual
, and
GreaterThan
. When you define those routines on a new data type, users can create B-tree indexes on the type.
Planning for Transaction Semantics
DataBlade module code runs in
SQL
transactions
. A transaction is a single, atomic, independent sequence of client/server interactions. For example, inside a transaction a user can search a table for all the cities that overlap Los Angeles. In a separate transaction, some other user can change the boundaries of Los Angeles, as outlying areas are incorporated into the city. The two operations are independent of one another. Each user is isolated from the changes made by the other until the next transaction begins.
You must define DataBlade module code to be stateless and not based on the assumption that any particular value persists across user transactions. For example, a DataBlade module that does text matching might provide two services: one to find all documents that contain a particular set of keywords; another to highlight the matching keywords in the documents.
If a user first runs a query to find matching documents, and then runs a separate query to highlight the matches, the second operation cannot rely on any cached results from the first. This is because some document contents might have changed. In addition, because Informix Dynamic Server is a multiuser system, different users can run the same routines at the same time. There is no way to guarantee that "saved" results belong to a particular query.
To ensure that you design systems that operate correctly in this environment, ask the following questions:
Are any of the routines based on an assumption that results from previous user actions are still valid?
Do I try to cache results for reuse?
What happens if two users run the same routine on the same table simultaneously?
DataBlade Developers Kit User's Guide
, version 3.6
Copyright © 1998, Informix Software, Inc. All rights reserved.