GeoServer : Process Integration
This page last changed on Mar 15, 2013 by jdeolive.
High level design discussion regarding extended integration of processing into the core of GeoServer.
Processing in GeoServer for the most part is a standalone subsystem exposed through the WPS service. Rendering transformations provided an initial peak at what integrating processing with other services , in this case WMS, looked like. While rendering transformations provide a great way to visualize the result of a process there is a possibility to go even further, integrating processing with other services such as WFS and WCS.
A deeper integration also offers the ability to provide a nicer user interface to expose processing functionality. For example defining a rendering transformation directly from the GeoServer ui rather than having to edit an SLD file manually.
Many types of processing involves working on a "global scale", requiring access to a complete dataset in order to function. This is in contradiction with tiled WMS that breaks up an area into many mall requests. As an example consider heat map generation. Constraining the data input to data intersecting a single tile does not make sense. Another example is routing.
Defining a new process can currently be done only programatically (whether through straight Java or via script). However, once a large base of processes has been established, new processes could be created simply by composing a number of the base ones. That would ease the creation of new processes and would give GeoServer a good platform to respond to the processing needs of most users. With the processing part of GeoServer more integrated, and tools to easily define new processes, base processes would become more useful.
In general the design boils down to adding a new type of resource to the catalog model, namely one that models a process. With that in place existing OGC services such as WMS/WFS, etc... will be updated to handle the new resource type.
The general idea here is to create a new type of ResourceInfo that models a process execution named ProcessInfo. And because the catalog model requires any resource to be part of a store we need a ProcessStoreInfo as well. The following simplified class diagram illustrates where the new classes fit in.
The ResourceInfo interface models a "typical" spatial dataaset. That is some spatial format containing data that is relatively static. The addition of ProcessInfo is interesting because it is a derived artifact. So the first question to answer is what exactly does a ProcessInfo represent. It is really a "process execution" including the data rather than a "process" as defined by GeoTools which is strictly just the algorithm itself. So with that we come up with a ProcessInfo being:
With that loose definition it is useful to examine the attributes of the the ResourceInfo interface and analyze how they apply to the process idea.
Much of the interface is about publishing metadata, much of which is specifying by the user when configuring the resource. This includes:
All of which is generally uninteresting and applies to ProcessInfo.
Moving on we reach the "native name". For other types of resources this attribute has referred to the name of the resource as known by the GeoTools object that encapsulates it. If we draw a parallel for process it would refer to the name of the process as known by GeoTools. Which doesn't really make sense since a ProcessInfo includes the input and output data as well. We also don't want to rule out a ProcessInfo being a chain of individual GeoTools processes.
An alternative interpretation would be the native name of the resource that encapsulates the input. This also doesn't seem to fit especially since such a name may not exist depending on the process inputs. For now we table this.
The next set of ResourceInfo attributes deal with the spatial metadata. This includes:
With other resource types this metadata is used to signify the crs/bounding box of the native data (which might be unknown) vs what we want to publish it as. One potential mapping to ProcessInfo could be to have the native crs/bbox correspond to that of the input data, and the declared crs/bbox correspond to the output data.
An alternative is to have both native and declared crs/bbox correspond to the output data, since that is what is published.
Given ou process definition we examine what is needed specially in the ProcessInfo interface. At a minimum we need to:
With that in mind consider the following class diagram that describes the makup of ProcessInfo:
Examining all the collaborators we start with the process property which is of type ExecuteType, defined in the GeoTools WPS object model. This property basically describes the process execution in exactly the same way it is described in a WPS request. The class is composed of basically the name of a process to run, and its inputs.
Second we consider the execMode property. The idea here is to provide different methods for how the process will execute. Here we present three options
These are only a few possibilities, there are many more making the ProcessExec class a good candidate for an extension point.
The second half of the effort is to integrate the new ProcessInfo interface into the existing OGC services. Which isn't exactly a trivial effort. Here we describe a possible approach for all the existing OGC services.
The basic idea is to make the integration as seamless as possible for the WCS, WMS, and WFS services. To do this we make it possible for FeatureTypeInfo and CoverageInfo to reference a ProcessInfo.
When this link is present the system will know that rather than to return the underlying data as is, which is normally the case, that the referenced process must first be executed. This smarts lives in the ResourcePool class, whose main job is to do the work of loading the underlying data from a configured GeoServer resource. The following simplified sequence diagram shows the basic interaction.
This approach follows the same pattern as "SQL Layers" in GeoServer does. In that case rather than just load data from an underlying table, it must first perform a configured query, and load the data from that query. The processing case is inherently more complex, but the same idea.
Following the approach above the system works completely transparently as far as the existing WMS, WFS, and WCS services are concerned. In a perfect world, there are indeed some issues with this approach. For instance, depending on the inputs of a process, the structure of the outputs may change. Once possible case is a process that returns a feature collection, with different attributes. This may cause some issues for WFS and WCS which usually require a relatively dataset schema. In cases like these it will be the job of the user configuring the process layer so that the schema remains stable enough for any WFS filtering, or WMS styling requirements.
The WPS service is an interesting one because the thing we care about (the process) is not something we configure in the catalog (well at least not until this effort is complete). Other than those processes made unavailable through filters a client has access to all GeoTools processes.
A ProcessInfo object can be viewed as a "predefined execution" meaning that part of its configuration is the same information that a WPS request would specify: a process to execute and some inputs to it.
This is best explained by an example. Consider the "gs:Heatmap" process. Typically to invoke it a client specifies they want to execute "gs:Heatmap" and specifies all the necessary inputs for it. However let's say the user has configured a ProcessInfo object named "foo:MyHeatmap" (foo being a workspace name). Since the process inputs are already defined the user is able to execute the process "foo:MyHeatmap" without any inputs. Or perhaps only supplying a subset of them.
So the work here would be to update the WPS operations to also consider ProcessInfo objects in addition to the "raw processes".
This would allow for some interesting possibilities, such as defining simplified processes. Many processes have a number of parameters that the user may not care about (at least not first). By wrapping them with a ProcessInfo object containing default values for some inputs, the process is made more accessible to that type of user.
This kind of approach allows to create several layers of processes, with the processing power and flexibility in the "raw" processes, and modified custom "versions" of them exposed instead. This is also easy to see with an example. The "gs:Contour" process has an input to define a set of contour levels and another one used to define an interval. Only one of those is used. This is error-prone, might be difficult for some users to understand, and makes it harder to create a robust UI, since automatic ones that create the input components on-the-fly (like the current WPS request builder) cannot take that into account. By wrapping this process in two different ones ("contour from levels" and "contour from intervals"), a more robust and user-friendly alternative can be exposed, while keeping the real processing in the same process and without having to do any coding. A similar approach can be found in desktop processing applications, such as QGIS, which exposes GRASS modules this way, splitting them in several ones when the complexity of the GRASS command-line syntax is too high for being handled in the UI.
With the above definitions, two main cases can be found: A ProcessInfo object has all its elements completed (a process, its inputs, etc), or some of them are missing (as in the case of a wrapped process explained previously). In this second case, the ProcessInfo does not have enough information to be used by the WMS,WCS and WFS services, since it lacks inputs and it cannot produce any output to be served by them. In the first case, since all inputs are filled, the process itself is of little interest for the WPS service, as it cannot be reused with other input values.
For these reasons, it might be interesting to divide this effort in two parts:
This approach should not involve more work, but just a different organization, and probably a better structuring. Moreover, the second part above can be started when the first one is finished, as they are completely independent. WPS processes would be already integrated and process-based layers could be published. The second part would extend the possibilities of the WPS service, and enhance all the other services on the way, since now they would be linked with the WPS one through the ProcessInfo class.
Configuring ProcessInfo objects will require a UI. The first inclination would be to poach from the WPS request builder but I think this is an opportunity to revisit that design.
The following are some ideas for improvement.
A list of questions and design decisions are looking for feedback on.
JD: Would a process store map 1:1 to the idea of a GeoTools ProcessFactory? Or should it just serve as an arbitrary grouping mechanism?
JD: By integrating directly into the Catalog/Services we introduce dependencies on all of GeoTools processing modules.
|Document generated by Confluence on May 14, 2014 23:00|