This page last changed on Mar 08, 2005 by dblasby.

Introduction

Please see DerivedFeatureType for an overview.

This describes an implementation for a datastore that wraps another datastore and produces a new FeatureType (with fewer rows) computed from the FeatureType produced by the wrapped datastore.

Basic Implementation in Databases

In order to explain how a database usually handles GROUP BY clauses, I'm going to explain how a very simple database might execute a simple query (the actual operation of GROUP BY is very much more complex than this).

Lets look at this simple query:

SELECT city_name, sum(temperature), count(month) FROM temp_data GROUP BY city_name;

Being executed on this data:

city_name   |  month | temperature
----------------------------------
Los Angeles |    1   |    10           |
Los Angeles |    2   |    11           |
Los Angeles |    3   |    12           |
Los Angeles |    4   |    13           |
Los Angeles |    5   |    14           |
Los Angeles |    6   |    15           |  GROUP 1
Los Angeles |    7   |    16           |
Los Angeles |    8   |    17           |
Los Angeles |    9   |    18           |
Los Angeles |    10  |    19           |
Los Angeles |    11  |    15           |
Los Angeles |    12  |    10           |
New York    |    1   |    0                    |
New York    |    2   |    1                    |
New York    |    3   |    2                    |
New York    |    4   |    3                    |
New York    |    5   |    4                    |
New York    |    6   |    5                    |
New York    |    7   |    6                    |  GROUP 2
New York    |    8   |    7                    |
New York    |    9   |    8                    |
New York    |    10  |    9                    |
New York    |    11  |    5                    |
New York    |    12  |    0                    |

The result would be:

city_name   |  sum(temperature)|  count(month)
----------------------------------------------
Los Angeles |   179            |     12
New York    |    50            |     12

The Aggregate functions

Aggregate functions start with an initial value. They are then called with two pieces of information - the results of the last call (or initial value for the first call), and the new piece of data from the database.

// simple class that does a running sum.
// starts off with sum=0
// each time a new value is added, the current sum is increased
public class aggregate_sum
{
   static double initial_value = 0;

   double current_value ;

  public aggregate_sum()
  {
       current_value = intial_value;
  }

  public void receiveDatum(Object fromDB)
  {
     current_value = current_value + ((Double) fromDB).doubleValue();
  }
   
   public double getResult()
   {
      return current_value;
   }
}

// simple class that does a running count
// it ignores the input data - and just computes the number of times the receivedDatum() function is called.
public class aggregate_count
{
   static int initial_value = 0;

   int current_value ;

  public aggregate_count()
  {
       current_value = intial_value;
  }

  public void receiveDatum(Object fromDB)
  {
     current_value = current_value + 1;
  }
   
   public int getResult()
   {
      return current_value;
   }
}

//simple implementation class for unioning polygons together.
// actual implementation needs to be more complex!
public class aggregate_geom_union
{
   ArrayList initial_value = new ArrayList();

   ArrayList current_value ;

  public aggregate_geom_union()
  {
       current_value = intial_value;
  }

  public void receiveDatum(Object fromDB)
  {
      current_value.add( (Geometry) fromDB);
  }
   
   //NOTE: we are doing this like com.vividsolutions.jtsexample.technique.UnionUsingBuffer
   //rather than just repeatedly calling union() because this method is more robust and faster. 
   // See the JTS javadox.
   // http://www.jump-project.org/docs/jts/1.5/api/com/vividsolutions/jtsexample/technique/UnionUsingBuffer.html
   // NOTE:
   //        THIS ONLY WORKS FOR POLYGONS!  USE A DIFFERENT TECHNIQUE FOR LINES, POINTS, OR MIXED TYPES
   public Geometry getResult()
   {
      Geometry[] geoms = (Geometry[]) current_value.toArray( new Geometry[current_value.size()]);
      GeometryFactory fact = geoms[0].getFactory();
      Geometry geomColl = fact.createGeometryCollection(geoms);
      Geometry union = geomColl.buffer(0.0);
      return union;
   }
}

The basic operation of all these is very simple:
1. start with the initial value
2. pump a single piece of data (or row) from the database to the function
3. when the set is finished, call the getResults() method.
4. if there are more sets to process, return to step #1

Most aggregate function are streamable - they take a constant amount of memory and are stateless.
The geometry union example is a bit different - it caches all the data values then preforms the actual union operation. This can waste memory, but, in general, the union result is "about" the same size as the inputs so its not a great waste.

Database operation

The database will execute the query as normal, and then "group together" all the rows that have the same value as the GROUP BY clause's column values.

The database typically does this by first sorting the dataset on the GROUP BY column (this can be done in a constant amount of memory with complex techniques). It then streams the result set through the aggregate functions. When it sees the GROUP BY column's value change it gets the result from the aggregate functions, resets them, outputs a result row, and then continues streaming.

If the aggregate functions can be done in a contant amount of memory, then this entire process can be done in a constant amount of memory.

Simple Implementation for Geotools

If datastores could return results sets in a sorted order, we could use the same techniques as the database. Unfortunately, doing sorting in constant memory is difficult to do, so I'm proposing something much simplier - just use a hashtable keyed by the GROUP BY column, that links to a Collection of Features.

You could then just stream each Collection of Features through the aggregation function.

NOTE: if there's no GROUP BY clause (ie. make one group - all the features returned by the datastore) then a hashtable is not required - you can directly stream to the aggregation function and produce one row as the result.

More details of implementation

Lets looks at these two, very similiar, queries:

// make a group for each river_name
SELECT aggregate_union(the_geom) as bigGeom, sum(nStudies) as total_studies, river_name as river_name
FROM myRiverPolygons
GROUP BY river_name;

// union ALL the rivers together
// this is streamable w/o a hash table!
SELECT aggregate_union(the_geom) as bigGeom, sum(nStudies) as total_studies
FROM myRiverPolygons;

The resulting DataStore for the first query would basically store this information:

  • link to the wrapped DataStore
  • column to group by ("river_name")
  • And the FeatureType definitions:
        * "bigGeom" (Geometry) - based on base column "the_geom" and the aggregate class called "aggregate_union"
        * "total_studies" (number) - based on base column "nStudies" and the aggregate class called "sum"
        * "river_name" (string) - based on the GROUP BY column "river_name"
    

The implementation is simple - retrieve results from the wrapped query and make a hashtable linking river_name to a set of feature. For each key in the hashtable, the collection of features is then collapsed to a single output feature. This collapsing is very simple as outlined in this pseudo-code:

ForEach group in the hashtable, grab the collection of features and:
{
    Feature  resultRow = <new feature>
    Iterator featureIterator = <iterator for a single GROUP>
    <initialize aggregate_union(), null() and sum()>
    while (featureIterator.hasNext())
    {
       Feature baseFeature = (Feature) featureIterator.next();
       <get "the_geom" from baseFeature and pass to aggregate_union()>
       <get "nStudies" from baseFeature and pass to sum()>
       <get "river_name" from baseFeature and pass to the null() aggregate>
    }
    set "bigGeom", "total_studies", and "river_name" in the resultRow
     by calling getResult() from the above 3 aggregates.
}

The "null()" aggregate is very simple - it just remember the first value passed to it (since they will always be the same for rows in a group).

The second query is evaluated in the same way, but is a bit simplier - no hash table is required. We can directly stream from the wrapped datastore!

NOTE: its possible to have a aggregate pull information from more than one column, but I think this is rare and can be added at a later date.

NOTE: FID generation is undefined - the resulting rows dont really exist. You could use any FID from the input dataset group or just return a unique code.

Query rewriting/Efficiency

Normally filtering will need to be completely performed by the Aggregate virtual DataStore, but there are two types of Filters that should be passed on to the wrapped datastore: Filters on the GROUP BY clause, and Bounding Box filters.

Here's an example of a query with a WHERE clause that references the GROUP_BY column:

// just return the summary information for 'Fraser River'
SELECT aggregate_union(the_geom) as bigGeom, sum(nStudies) as total_studies, river_name as river_name
FROM myRiverPolygons
WHERE river_name = 'Fraser River'
GROUP BY river_name;

Since we're putting a filter on the GROUP BY column ("river_name"), we should pass this off to the wrapped DataStore to limit the number of rows that need to be processed by the virtual DataStore. If you do not do this, we would generate summary information for ALL rivers, then filter the results - this is a lot of unneccessary work!

Geometric Filtering is a bit more complex - so the user should have two options:

  • Filter by Bounding Box AFTER aggregation
  • Filter by Bounding Box BEFORE aggregation (ie. pass to wrapped datastore).

The first is more correct - but extreamly inefficient. You will be computing the aggregation of the ENTIRE dataset (!!), then just pulling the portions that overlap your bounding box.

The first isnt quite correct - but efficient. The base datastore will do a bounding box filter FIRST, then the result of this will be passed in for aggregation. This could cause unexpected problems. For example, in the above queries, only the segments in the underlying datastore that are in the bounding box will actually be sent to the aggregator. This means that segments outside the bounding box will not be used to compute the sum(nStudies) aggregation!

Since the memory requirements for aggregation can be large, the aggregation datastore should be give a maximum number of features to pull from the underlying datastore. This would be user-configurable.

Sample Aggregate Functions

Geometry Aggregates:

  • geometry union (union of all the geometries)
  • geometry intersection (where all the geometries overlap)
  • geometry convexhull (convex hall of all the input geometries)

The non-geometric aggregates are very simple:

  • sum (add all the input numbers)
  • count (number of rows in a set)
  • avg (average number of the input number - implemented based on doing the above at the same time)
  • catenate (adding strings together)
  • min (minimum value for a set of numbers)
  • max (maximum value for a set of numbers)

NOTE: the geometric aggregates can be quite complex because JTS does not handle GeometryCollections well for union and intersection. These problems can be worked around.

Use Cases

There are not as many use cases for this type of virtual datastore, but it is essencial for having good WMS maps.

  • summaries of features
  • nice maps

See also

Oracle Spatial Aggregates

Document generated by Confluence on May 14, 2014 23:00