This page last changed on Mar 07, 2005 by dblasby.

Introduction

Please see DerivedFeatureType for an overview.

This describes an implementation for a datastore that wraps another datastore and produces a new FeatureType computed from the FeatureType produced by the wrapped datastore.

This "virtual" datastore needs to be able to transform an input FeatureType into a new FeatureType, plus be able to handle queries.

Basic Implementation (reading/conversion)

The simple idea is to setup a simple set of associations: (a) new attribute name (b) new attribute definition (based on the OGC Filter Function clause) and (c) new attribute type.

For example, consider the query:

SELECT buffer(geom, river_width) as geom, 
       (num_fed_studies + num_state_studies) as nStudies,
       length(geom) as river_len
       width as width
FROM myRiverTable;

With the following input and output:

geom|num_fed_studies|num_state_studies|width             geom|nStudies|river_len|width
--------------------------------------------             -----------------------------
LINE|       2       |        1        |  30              POLY|   3    |   600   | 30
LINE|       0       |        2        |  28              POLY|   2    |   500   | 28
LINE|       1       |        0        |  31              POLY|   1    |   400   | 31

This defines an FeatureType with four attributes, defined according to the FeatureType they are derived from:

"geom" (of type Geometry) with the definition (see "Functions" below):

<Function name="JTS_BUFFER">
    <PropertyName>geom</PropertyName>
    <PropertyName>width</PropertyName>
</Function>

"nStudies" (of type integer) with the definition:

<Add>
    <PropertyName>num_fed_studies</PropertyName>
    <PropertyName>num_state_studies</PropertyName>
</Add>

"river_len" (of type double) with the definition:

<Function name="JTS_GETLENGTH">
    <PropertyName>geom</PropertyName>
</Function>

"width" (of type double) with the definition:

<PropertyName>width</PropertyName>

NOTE: I havent given any FID information - this should be derived from the wrapped datastore's FID. For example, converting "myRiverTable:3587" to "computedRiverTable:myRiverTable:3587" using namespace information.

The attribute type not be required because it can be determined by looking at the definition, see discussion on functions (below).

See the section on Functions, below, but reading from the "virtual" datastore should be as simple as reading a single feature from the wrapped datastore and creating a new feature by executing the above function calls on the wrapped feature. The abstract "filter evaluator" (for non-sql datstores) should be able to handle this type of evaluation.

Handling Queries (query rewriting)

Handling queries is actually quite simple. We just re-write the input query so it applies to the wrapped datastore. This should be as simple as replacing any "<PropertyName>...</PropertyName>" in the input query with the function-based definition given above.

For example, consider the following query on the wrapped datastore defined above:

<Filter>
   <PropertyIsBetween>
       <PropertyName>nStudies</PropertyName>
       <LowerBoundary>2</LowerBoundary>
       <UpperBoundary>10</UpperBoundary>
   </PropertyIsBetween>
</Filter>

by replacing the "<PropertyName>nStudies</PropertyName>" with its definition we get:

<Filter>
   <PropertyIsBetween> 
            <!-- replaced section -->
           <Add>
                <PropertyName>num_fed_studies</PropertyName>
                <PropertyName>num_state_studies</PropertyName>
            </Add> 
       <LowerBoundary>2</LowerBoundary>
       <UpperBoundary>10</UpperBoundary>
   </PropertyIsBetween>
</Filter>

This is then executed on the wrapped dataset.

Efficiency

Unfortunately, the re-written queries can be inefficient. This inefficiency can be reduced by taking advantage of the wrapped datastore's advanced indexing abilities.

Special case for Bounding Box queries

Here's a example of the problem (in the spatial context):

In this example, we see that the derived (buffered) dataset intersects the query bounding box, but the original (wrapped) dataset does not. If you were to execute a query like this on a PostGIS datastore, you would see a query along the lines of this:

-- NOTE: "buffer(geom, river_width) as geom" is done by the virtual datastore inside GeoTools
--       not by the datastore.  The datastore's job is to supply the underlying data
--       and perform correct filtering. 
SELECT geom, ...  
FROM myRiverTable
WHERE      buffer(geom, river_width) && <query bounding box> 
      AND  intersects(buffer(geom, river_width, <query bounding envelope> ));

NOTE:

  • for PostGIS, the buffer(geom, river_width) will only actually be computed once because the database will cache the result. Most datastores will calculate the buffer twice! Geotools will also calculate buffer(geom,river_width).
  • if there were a spatial index on "geom", it would not be used because PostGIS (and most other datastore) will not be able to relate the index on "geom" to "buffer(geom, river_width)". This means that the datastore will perform a sequencial scan of the database; it will compute buffer(geom,river_width) for every row in your table!

This is clearly too inefficient. I am, therefore, recommending that when the datastore is first configured that the user can specify special behavior for a bounding box search.

For example, if the user knows the widest river is 100m then they can "grow" the input bounding box by 100m and send that bbox to the wrapped datastore. NOTE: the 100m is a constant amount that is added to every query bbox.

-- NOTE: "buffer(geom, river_width) as geom" is done by the virtual datastore inside GeoTools
--       not by the datastore.  The datastore's job is to supply the underlying data
--       and perform correct filtering. 
SELECT geom, ...  
FROM myRiverTable
WHERE      buffer(geom, river_width) && <query bounding box> 
      AND  intersects(buffer(geom, river_width, <query bounding envelope> ))
       --- added clause
      AND  geom && <query bounding box expanded by 100m>;

This solution is not ideal because it could return too few rows (if the user's bbox expansion behavior is incorrect). It is up user to ensure that their specified behavior will produce the correct results. This solution is, however, very easy to implement and should work well for the majority of cases.

NOTE: if the user does not actually modify the wrapped geometries (ie. the new geometry is defined to be "<PropertyName>...<PropertyName>") in the "virtual" datastore, then there will be no problems. The bounding box will "pass through", unmodified, to the wrapped datastore and be indexable by whatever means the datastore does its spatial indexing.

Functional Index in Database-based Datastores

Most advanced databases (like postgresql and oracle) allow for an index to be built not only on actual columns, but on expressions involving data in the columns. The postgresql manual has more details here.

-- builds an index for queries of the form:
-- SELECT * FROM myRiverTable WHERE buffer(geom,width) && <bounding box>;
CREATE INDEX buffer_geom_indx ON myRiverTable ( buffer(geom,width) );

Once this index has been created, then the queries (given above) will actually use the index. Building indexes on expressions works to solve problems of this sort in general, but:

  • its only going to work on Datastores that are backed by advanced SQL databases. Other datastores will have to do full sequential scans.
  • some databases (like postgresql) have a difficult time indexing operators (for example "+", "-", and "*"). Both the index and actual query have to be made in terms of actual functions (like numeric_add() and int4pl()) - requiring changes to how the Geotools SQL generator actually generates queries. For example:
//postgresql specific.  numeric_add(a,b) --> a+b
SELECT numeric_add(num_fed_studies , num_state_studies) as nStudies, ...
FROM myRiverTable
WHERE  numeric_add(num_fed_studies, num_state_studies) > 10;

CREATE INDEX fed_plus_state_studies ON myRiverTable  numeric_add(num_fed_studies + num_state_studies);

I expect that most actual WFS/WMS queries will involve a bounding box - the simple bounding box solution given above is both extreamly simple, effective, and will capture most of the use-cases.

Read Only

Obviously, the derived datastore is read only. For example, if we update the derived attribute "nStudies" to 10, we have insufficient information to update the underlying datastore (with num_fed_studies and num_state_studies).

In general, one would allow modification to the wrapped datastore - the derived datastore will automatically "pick up" modifications.

OGC/Geotools Filter and Functions

The OGC specification allows for arbitrary functions to be called. It appears that there are only a few actually implemented in the Geotools Expression package.

Simple math functions ("+", "-", "/", and "*") and logic functions ("and", "or") are handled more directly in the Filter Expression.

It appears the only "extension" functions that Geotools supports is "Max(number,number)" and "Min(number,number)". I think this can be easily extended in the same manner as the Hypersonic SQL DB allows custom (static) functions to be called.

Spatial DB in a Box already sets up a StaticGeometry class that converts almost all the JTS functionality into "static" method. For example:

static public Geometry buffer(Geometry arg0,double arg1)
{
      Geometry _this = arg0;

      return _this.buffer(arg1);
}

The basic idea is to:
1. have a single class (or set of classes) like StaticGeometry that contain all the possible callable functions. (we can also do a FactoryFinder trick to allow users to automagically add new functions)
2. use reflection on these classes so that the system would "know" what functions are available (ie. for WFS getCapabilities)
3. use reflection to actually call the functions

NOTE: polymorphism (ie. functions with the same name, but different argument types or argument numbers) should not be allowed because they will make things much more difficult to code. Simply make all functions have unique names. For example, JTS supports a geometry.buffer(double) and a geometry.buffer(double,int) - these could be called "buffer", and "buffer_with_segmentCount").

For example, consider this inside a Filter:

<Function name="buffer">
    <PropertyName>geom</PropertyName>
    <PropertyName>width</PropertyName>
</Function>

Pseudo-code for evaluating this function (see MaxFunction.java in the Filter package):

   using reflection, get the argument types for the function org.openplans.StaticGeometry.buffer 
            -> first argument is Geometry type, second is Double type
   convert the first argument to a Geometry
           ->  ((Geometry) arg[0].getValue(feature))
   convert the second argument to a number:
           -> ((Number) arg[1].getValue(feature))
   use reflection and call "buffer" with these two arguments
   return the resulting Geometry

Looking at the code, it appears that Geotools currently supports expressions of numberic types, Geometry, and String. This is probably enough for most users, but they may want to do things like call functions on Date objects.

As I mentioned above, the return types of these functions should be explicit (and determined with reflection) - this will allow for automatic generation of AttributeTypes for the 'calculated' columns.

Other popular functions:

  • substring
  • catenate
  • switch statements (might have a messy representation)

Sample Uses

When I first used a WFS, I just assumed that you could perform functions on the returning columns - not just functions for the filters. This type of "virtual" datastore that makes derived features operates like a very simple SQL view and has many actual use-cases:

  • Stop wasting space (and maintanance) by keeping multiple copies of the data around - for example, a shapefile (or table) with the underlying data in it AND a shapefile with the computed data in it. Its much easier to attach the WFS to existing data sets without having to generate another dataset specifically for the webservice.
  • Computed fields will be auto-matainted by the function definition. Trying to do this with pre-computed columns (ie. "length" attribute vs "length(geom)" function) can be difficult to keep up-to-date across updates (ie. someone changes the geometry and forgets to update the length).
  • Creating new (interesting) geometries
  • Simple rename of columns names (or changing their type; for example a String to an Integer or a Double to an Integer)
  • Extreamly important for making "nice looking" maps with SLDs.

querybbox.gif (image/gif)
querybbox.gif (image/gif)
Document generated by Confluence on May 14, 2014 23:00