This page last changed on Apr 23, 2012 by groldan.

Overview

The following are identified use cases in the GeoServer code base that should cover most situations where the main scalability and/or performance bottle neck is in the Catalog's client code and not in the Catalog's ability to serve large amounts of configuration objects.

  1. Secure Catalog Decorator: a full scan of Catalog resources is performed on each get*():<List> request and a separate list is built for the current user's accessible objects, even if the Catalog returns an immutable list, affecting both memory consumption and processing time.
  2. Wicket User Interface: Home page gets the whole list of workspaces, stores, and layers only to get their size. Catalog resource list pages (e.g. LayerPage, StorePage, etc) do so to a) return the iterator for the current page of data, b) obtain the full list of objects, c) obtain the filtered list of objects, d) obtain the total number of objects, e) obtain the filtered number of objects.
  3. WMS GetCapabilities: Generation of a WMS Capabilities document implies fetching the full list of layers multiple times, in order to a)filter the layer list based on the request's NAMESPACE parameter, b)calculate the layer list's aggregated bounds, c) figure out a common CRS to all the layers, and d) build an in-memory layer tree in order to nest layers based on the LayerInfo's wms "path" attribute;
    .

Use Cases

SecureCatalogImpl

org.geoserver.security.SecureCatalogImpl is a decorator around org.geoserver.catalog.Catalog that applies in-process filtering of restricted catalog resources based on the configured data security policies.

Whenever the current list of some concrete catalog resource is requested (e.g. getLayers():List<LayerInfo>), this in-process filtering consists of the following steps:

  1. Obtain the original list of catalog objects from the decorated Catalog;
  2. For each catalog object:
    1. Check if the catalog object is accessible to the current user
    2. Create a security decorator for the catalog object, if accessible
  3. If accessible, add the security decorator to the return list
  4. return the filtered list of catalog objects

The problem with this approach is that a full scan of Catalog resources is performed on each request and a separate list is built for the accessible objects, even if the Catalog returns an immutable list, affecting both memory consumption and processing time.

A more scalable approach would be:

  1. Create a query predicate that matches catalog objects based on current user's credentials
  2. Query the decorated Catalog with that filter predicate
  3. Obtain the filtered and immutable list of Catalog objects
  4. Return a list decorator that applies a secured decorator to each returned object on demand.

In turn, the Catalog backend, besides returning only the list of matching objects, could be able of transforming the query predicate,or part of it, to the native backend's query language.

Wicket User Interface

  • The GeoServer Home Page presents a list of number of available workspaces, stores, and layers. To do so, it asks the Catalog for the list of each of those resources and calls the list's size():int method (e.g. getCatalog().getLayers().size(). Having a large number of resources (say, layers) means going through the SecureCatalog back to the actual Catalog, each of which returns a safe copy of the actual catalog objects, just to finally get the number of objects in the list.
  • The Catalog resource list pages (e.g. org.geoserver.web.data.layer.LayerPage, org.geoserver.web.data.store.StorePage present even a more challenging use case:

They present the full list of catalog objects of a given type in a paged list, allowing for sorting and filtering based on direct or computed properties.
They also display the total number of objects, as well as the number of objects that match the current filter, if any.

In order to do so, the GeoServer wicket "framework", through GeoServerDataProvider leverages on the following API and default behavior, while being a "template method" class provides the hooks to optimize and avoid loading everything into memory:

abstract class GeoServerDataProvider<T> extends org.apache.wicket.extensions.markup.html.repeater.util.SortableDataProvider{

    /** @return iterator capable of iterating 
     *   over {first, first+count} items */
    @Override
    public Iterator<T> iterator(int first, int count) {
        List<T> items = getFilteredItems();

        // global sorting
        Comparator<T> comparator = getComparator(getSort());
        if (comparator != null) {
            Collections.sort(items, comparator);
        }

        // in memory paging
        int last = first + count;
        if (last > items.size())
            last = items.size();
        return items.subList(first, last).iterator();
    }

    /** @return the size of the filtered item collection */
    @Override
    public int size() {
        return getFilteredItems().size();
    }

    /** @return a non filtered list of all
     *  the items the provider must return 
     */
    protected abstract List<T> getItems();

    /** @eturn the global size of the collection, 
     *  without filtering it 
     */
    public int fullSize() {
        return getItems().size();
    }

    /** @return a filtered list of items. Subclasses can 
     *   override if they have a more efficient way of filtering
     *   than in memory keyword comparison 
     */
    protected List<T> getFilteredItems() {
        List<T> items = getItems();

        // if needed, filter
        if (keywords != null && keywords.length > 0) {
            return filterByKeywords(items);
        } else {
            // make a deep copy anyways, the catalog 
            // does not do that for us
            return new ArrayList<T>(items);
        }
    }
    .....
}

Then, concrete resource list pages (e.g. LayerPage), use specializations of GeoServerDataProvider to fill in a GeoServerTablePanel, which in turn cares about the GeoServerDataProvider's public API (size():int, fullSize():int, iterator(int, int):Iterator<T>}}.

The protected List<T> getItems() method is implemented by concrete data providers such as:

class LayerProvider extends GeoServerDataProvider<LayerInfo>{
    @Override
    protected List<LayerInfo> getItems() {
        return getCatalog().getLayers();
    }   
    ....
}

So in this case, a full scan and defensive copy of catalog resources is being built for each of:

  1. Getting the total number of objects
  2. Getting the filtered number of objects
  3. Getting the filtered list of objects to return an Iterator

Making for a catalog objects list page very resource (memory and processing) intensive, up to impractical as the number of resources increments.

An approach that scales better should allow for:

  • Getting the number of matching objects, given a query predicate, directly from the Catalog with no need to traverse a list of results
  • Obtaining an iterator directly from the Catalog for the objects that match a query predicate.
  • Allows to specify and get the results sorted directly from the catalog

WMS GetCapabilities

Performing a WMS GetCapabilities request when the number of layers is large enough (tested with 10.000 and 100.000) becomes quickly un-practicable as the Capabilities_1_3_0_Translator fetches the full list of layers multiple times for some post processing:

  1. To filter the layer list based on the request's NAMESPACE parameter, if present;
  2. To calculate the layer list's aggregated bounds;
  3. To build an in-memory layer tree in order to nest layers based on the LayerInfo's wms "path" attribute;

This makes for a capabilities request to potentially make GeoServer go out of memory, at least for the cases where the catalog storage is off heap; although the creation of a separate list of LayerInfo when there's a namespace filter, and the creation of the in-memory LayerTree object holding all layers can also lead to problems even if the catalog is fully in memory.

Although an argument can be made that a GetCapabilities response with tens or hundreds of thousands of layers would be totally impractical for almost any client (but perhaps a crawler), such an operation should not bring GeoServer down nonetheless.

Now, a possible solution seems not to be completely tied to a better (or streaming) Catalog API, but also to improving the logic of the GetCapabilities translator itself:

  1. The namespace filter should be passed back to the catalog backend, so that the translator gets only the matching layers with no need to in-process filtering
  2. Building an in-memory tree of all the layers for the rather rare case of layers configured with the wms path attribute is non practical. It would be better if all the layers that are not nested through the wms path attribute are encoded in a streamed way, while the in-memory tree is built only for those that do have the wms path set.
  3. Furthermore, it would be possible/desirable to:
    1. Do a single pass over the resulting list of layers, encoding the layers to a temporary resource while at the same time building the aggregated bounds;
    2. Cache the results of the whole operation some place in order to return the cached document as long as some change indicator, such as the updateSequence has not changed. Although some thought should be put into this to account for the cases where the configuration changes at GeoServer's back, such as a third process modifying the backend storage directly (could be even another GeoServer in a load-balanced set up, real clustering should take care of a shared updateSequence). But this may become even more complex as we'd need to take authentication/authorization into account. If at all, caching for the anonymous user would perhaps bring the higher benefit at the lower cost.

Return to the main proposal page

Document generated by Confluence on May 14, 2014 23:00