GeoServer : aggregatederivedfeaturetype
This page last changed on Mar 08, 2005 by dblasby.
Please see DerivedFeatureType for an overview.
This describes an implementation for a datastore that wraps another datastore and produces a new FeatureType (with fewer rows) computed from the FeatureType produced by the wrapped datastore.
In order to explain how a database usually handles GROUP BY clauses, I'm going to explain how a very simple database might execute a simple query (the actual operation of GROUP BY is very much more complex than this).
Lets look at this simple query:
Being executed on this data:
The result would be:
Aggregate functions start with an initial value. They are then called with two pieces of information - the results of the last call (or initial value for the first call), and the new piece of data from the database.
The basic operation of all these is very simple:
Most aggregate function are streamable - they take a constant amount of memory and are stateless.
The database will execute the query as normal, and then "group together" all the rows that have the same value as the GROUP BY clause's column values.
The database typically does this by first sorting the dataset on the GROUP BY column (this can be done in a constant amount of memory with complex techniques). It then streams the result set through the aggregate functions. When it sees the GROUP BY column's value change it gets the result from the aggregate functions, resets them, outputs a result row, and then continues streaming.
If the aggregate functions can be done in a contant amount of memory, then this entire process can be done in a constant amount of memory.
If datastores could return results sets in a sorted order, we could use the same techniques as the database. Unfortunately, doing sorting in constant memory is difficult to do, so I'm proposing something much simplier - just use a hashtable keyed by the GROUP BY column, that links to a Collection of Features.
You could then just stream each Collection of Features through the aggregation function.
NOTE: if there's no GROUP BY clause (ie. make one group - all the features returned by the datastore) then a hashtable is not required - you can directly stream to the aggregation function and produce one row as the result.
Lets looks at these two, very similiar, queries:
The resulting DataStore for the first query would basically store this information:
The implementation is simple - retrieve results from the wrapped query and make a hashtable linking river_name to a set of feature. For each key in the hashtable, the collection of features is then collapsed to a single output feature. This collapsing is very simple as outlined in this pseudo-code:
The "null()" aggregate is very simple - it just remember the first value passed to it (since they will always be the same for rows in a group).
The second query is evaluated in the same way, but is a bit simplier - no hash table is required. We can directly stream from the wrapped datastore!
NOTE: its possible to have a aggregate pull information from more than one column, but I think this is rare and can be added at a later date.
NOTE: FID generation is undefined - the resulting rows dont really exist. You could use any FID from the input dataset group or just return a unique code.
Normally filtering will need to be completely performed by the Aggregate virtual DataStore, but there are two types of Filters that should be passed on to the wrapped datastore: Filters on the GROUP BY clause, and Bounding Box filters.
Here's an example of a query with a WHERE clause that references the GROUP_BY column:
Since we're putting a filter on the GROUP BY column ("river_name"), we should pass this off to the wrapped DataStore to limit the number of rows that need to be processed by the virtual DataStore. If you do not do this, we would generate summary information for ALL rivers, then filter the results - this is a lot of unneccessary work!
Geometric Filtering is a bit more complex - so the user should have two options:
The first is more correct - but extreamly inefficient. You will be computing the aggregation of the ENTIRE dataset (!!), then just pulling the portions that overlap your bounding box.
The first isnt quite correct - but efficient. The base datastore will do a bounding box filter FIRST, then the result of this will be passed in for aggregation. This could cause unexpected problems. For example, in the above queries, only the segments in the underlying datastore that are in the bounding box will actually be sent to the aggregator. This means that segments outside the bounding box will not be used to compute the sum(nStudies) aggregation!
Since the memory requirements for aggregation can be large, the aggregation datastore should be give a maximum number of features to pull from the underlying datastore. This would be user-configurable.
The non-geometric aggregates are very simple:
NOTE: the geometric aggregates can be quite complex because JTS does not handle GeometryCollections well for union and intersection. These problems can be worked around.
There are not as many use cases for this type of virtual datastore, but it is essencial for having good WMS maps.
|Document generated by Confluence on May 14, 2014 23:00|