This page last changed on Mar 15, 2013 by jdeolive.

High level design discussion regarding extended integration of processing into the core of GeoServer.

Motivation

Processing in GeoServer for the most part is a standalone subsystem exposed through the WPS service. Rendering transformations provided an initial peak at what integrating processing with other services , in this case WMS, looked like. While rendering transformations provide a great way to visualize the result of a process there is a possibility to go even further, integrating processing with other services such as WFS and WCS.

A deeper integration also offers the ability to provide a nicer user interface to expose processing functionality. For example defining a rendering transformation directly from the GeoServer ui rather than having to edit an SLD file manually.

Example Use Cases

Tiling Global Transformations

Many types of processing involves working on a "global scale", requiring access to a complete dataset in order to function. This is in contradiction with tiled WMS that breaks up an area into many mall requests. As an example consider heat map generation. Constraining the data input to data intersecting a single tile does not make sense. Another example is routing.

Compound Processes

Defining a new process can currently be done only programatically (whether through straight Java or via script). However, once a large base of processes has been established, new processes could be created simply by composing a number of the base ones. That would ease the creation of new processes and would give GeoServer a good platform to respond to the processing needs of most users. With the processing part of GeoServer more integrated, and tools to easily define new processes, base processes would become more useful.

Design

Overview

In general the design boils down to adding a new type of resource to the catalog model, namely one that models a process. With that in place existing OGC services such as WMS/WFS, etc... will be updated to handle the new resource type.

Catalog Integration

The general idea here is to create a new type of ResourceInfo that models a process execution named ProcessInfo. And because the catalog model requires any resource to be part of a store we need a ProcessStoreInfo as well. The following simplified class diagram illustrates where the new classes fit in.

The ResourceInfo interface models a "typical" spatial dataaset. That is some spatial format containing data that is relatively static. The addition of ProcessInfo is interesting because it is a derived artifact. So the first question to answer is what exactly does a ProcessInfo represent. It is really a "process execution" including the data rather than a "process" as defined by GeoTools which is strictly just the algorithm itself. So with that we come up with a ProcessInfo being:

  • The GeoTools process / the algorithm that does the processing
  • The input data the process works on
  • The output data the process produces

ResourceInfo Metadata

With that loose definition it is useful to examine the attributes of the the ResourceInfo interface and analyze how they apply to the process idea.

Much of the interface is about publishing metadata, much of which is specifying by the user when configuring the resource. This includes:

  • name
  • title
  • abstract
  • description
  • keywords

All of which is generally uninteresting and applies to ProcessInfo.

Moving on we reach the "native name". For other types of resources this attribute has referred to the name of the resource as known by the GeoTools object that encapsulates it. If we draw a parallel for process it would refer to the name of the process as known by GeoTools. Which doesn't really make sense since a ProcessInfo includes the input and output data as well. We also don't want to rule out a ProcessInfo being a chain of individual GeoTools processes.

An alternative interpretation would be the native name of the resource that encapsulates the input. This also doesn't seem to fit especially since such a name may not exist depending on the process inputs. For now we table this.

The next set of ResourceInfo attributes deal with the spatial metadata. This includes:

  • native crs
  • native bounding box
  • declared crs
  • declared bounding box
  • projection policy

With other resource types this metadata is used to signify the crs/bounding box of the native data (which might be unknown) vs what we want to publish it as. One potential mapping to ProcessInfo could be to have the native crs/bbox correspond to that of the input data, and the declared crs/bbox correspond to the output data.

An alternative is to have both native and declared crs/bbox correspond to the output data, since that is what is published.

ProcessInfo Specifics

Given ou process definition we examine what is needed specially in the ProcessInfo interface. At a minimum we need to:

  • define the process chain, including process name(s) and inputs
  • provide information about how to publish results of the process
  • specify how the process should execute

With that in mind consider the following class diagram that describes the makup of ProcessInfo:

Examining all the collaborators we start with the process property which is of type ExecuteType, defined in the GeoTools WPS object model. This property basically describes the process execution in exactly the same way it is described in a WPS request. The class is composed of basically the name of a process to run, and its inputs.

Second we consider the execMode property. The idea here is to provide different methods for how the process will execute. Here we present three options

  1. On Demand - The process is executed on the fly each time it is accessed by a service
  2. Cached - The process is executed on demand only when the results of the previous run has become invalid (ie "dirty")
  3. Scheduled - The process runs at a defined interval and at any given time a service accessing the process receives the most recent result

These are only a few possibilities, there are many more making the ProcessExec class a good candidate for an extension point.

Service Integration

The second half of the effort is to integrate the new ProcessInfo interface into the existing OGC services. Which isn't exactly a trivial effort. Here we describe a possible approach for all the existing OGC services.

The basic idea is to make the integration as seamless as possible for the WCS, WMS, and WFS services. To do this we make it possible for FeatureTypeInfo and CoverageInfo to reference a ProcessInfo.

When this link is present the system will know that rather than to return the underlying data as is, which is normally the case, that the referenced process must first be executed. This smarts lives in the ResourcePool class, whose main job is to do the work of loading the underlying data from a configured GeoServer resource. The following simplified sequence diagram shows the basic interaction.

This approach follows the same pattern as "SQL Layers" in GeoServer does. In that case rather than just load data from an underlying table, it must first perform a configured query, and load the data from that query. The processing case is inherently more complex, but the same idea.

WMS, WFS, WCS

Following the approach above the system works completely transparently as far as the existing WMS, WFS, and WCS services are concerned. In a perfect world, there are indeed some issues with this approach. For instance, depending on the inputs of a process, the structure of the outputs may change. Once possible case is a process that returns a feature collection, with different attributes. This may cause some issues for WFS and WCS which usually require a relatively dataset schema. In cases like these it will be the job of the user configuring the process layer so that the schema remains stable enough for any WFS filtering, or WMS styling requirements.

WPS

The WPS service is an interesting one because the thing we care about (the process) is not something we configure in the catalog (well at least not until this effort is complete). Other than those processes made unavailable through filters a client has access to all GeoTools processes.

A ProcessInfo object can be viewed as a "predefined execution" meaning that part of its configuration is the same information that a WPS request would specify: a process to execute and some inputs to it.

This is best explained by an example. Consider the "gs:Heatmap" process. Typically to invoke it a client specifies they want to execute "gs:Heatmap" and specifies all the necessary inputs for it. However let's say the user has configured a ProcessInfo object named "foo:MyHeatmap" (foo being a workspace name). Since the process inputs are already defined the user is able to execute the process "foo:MyHeatmap" without any inputs. Or perhaps only supplying a subset of them.

So the work here would be to update the WPS operations to also consider ProcessInfo objects in addition to the "raw processes".

This would allow for some interesting possibilities, such as defining simplified processes. Many processes have a number of parameters that the user may not care about (at least not first). By wrapping them with a ProcessInfo object containing default values for some inputs, the process is made more accessible to that type of user.

This kind of approach allows to create several layers of processes, with the processing power and flexibility in the "raw" processes, and modified custom "versions" of them exposed instead. This is also easy to see with an example. The "gs:Contour" process has an input to define a set of contour levels and another one used to define an interval. Only one of those is used. This is error-prone, might be difficult for some users to understand, and makes it harder to create a robust UI, since automatic ones that create the input components on-the-fly (like the current WPS request builder) cannot take that into account. By wrapping this process in two different ones ("contour from levels" and "contour from intervals"), a more robust and user-friendly alternative can be exposed, while keeping the real processing in the same process and without having to do any coding. A similar approach can be found in desktop processing applications, such as QGIS, which exposes GRASS modules this way, splitting them in several ones when the complexity of the GRASS command-line syntax is too high for being handled in the UI.

An alternative approach for compound and wrapped processes.

With the above definitions, two main cases can be found: A ProcessInfo object has all its elements completed (a process, its inputs, etc), or some of them are missing (as in the case of a wrapped process explained previously). In this second case, the ProcessInfo does not have enough information to be used by the WMS,WCS and WFS services, since it lacks inputs and it cannot produce any output to be served by them. In the first case, since all inputs are filled, the process itself is of little interest for the WPS service, as it cannot be reused with other input values.

For these reasons, it might be interesting to divide this effort in two parts:

  • Creating the ProcessInfo structure defined above, but limited to a single process with all its inputs filled. This defines a new data store that produces its data from a process, and is to be used with the WMS, WCS and WFS services.
  • Creating an interface (and the necessary elements underneath) for easy creation of compound or simplified processes, that they can later be published as a new one (basically, a way of generating new processes without coding, accessible to non-programmers), and used by a ProcessInfo. This is targeted at the WPS service, as it defines a way of extending its capabilities. This would use a different class than ProcessInfo, which should extend from the Process class.

This approach should not involve more work, but just a different organization, and probably a better structuring. Moreover, the second part above can be started when the first one is finished, as they are completely independent. WPS processes would be already integrated and process-based layers could be published. The second part would extend the possibilities of the WPS service, and enhance all the other services on the way, since now they would be linked with the WPS one through the ProcessInfo class.

User Interface

Configuring ProcessInfo objects will require a UI. The first inclination would be to poach from the WPS request builder but I think this is an opportunity to revisit that design.

The following are some ideas for improvement.

  • Improvements to the current UI for defining a single process call. The current interface has some issues that should be solved (lack of checking, unable to seledct multiple input values when maxOccur > 1, etc). Some work on this has already been done and is prepared to be committed, but further polishing might be needed.
  • Removal of redundant or unnecessary elements. Output format is not needed if the process is part of a compound process and not a final output. This could also be the case even if it is final, if that choice should be left for each execution of the process and not be part of the ProcessInfo object
  • Workflow definition in a "downstream" fashion rather than an "upstream" one, as in the current WPS request builder. This is a more natural way of defining compound processes, specially for the case of simple linear ones, which are the ones more likely to be used for rendering transformations, so processes are added in the same order as they are executed. A first point to start this can be working on a simplified interface for those linear un-branched process workflows, which are more likely to be used as rendering transformations than full branched ones. Although more limited, an interface like that would make it easy to handle a large fraction of the most usual cases.

Open Questions

A list of questions and design decisions are looking for feedback on.

ProcessStoreInfo - What would it represent?

JD: Would a process store map 1:1 to the idea of a GeoTools ProcessFactory? Or should it just serve as an arbitrary grouping mechanism?

Do we care about the additional dependencies?

JD: By integrating directly into the Catalog/Services we introduce dependencies on all of GeoTools processing modules.


cd1.png (image/png)
cd2.png (image/png)
cd3.png (image/png)
seq.png (image/png)
Document generated by Confluence on May 14, 2014 23:00