GeoServer : GeoGit approach
This page last changed on Aug 09, 2011 by cholmes.
Following on the core Versioning WFS work, in 2011 OpenGeo started experimenting with a new way to handle versioning, drawing on git, a distributed versioning system built for Linux and widely used. There are two different paths taken, both of which warrant further investigation:
The first experiment was to directly use git and github, storing files of geojson and reading from there. David Winslow managed to get this working in a limited way, reading in to GeoTools. A storage scheme was devised that implements spatial indexing without using any binary files, and keeping the size of any individual file within a limited range. This is accomplished by using a directory structure in the filesystem itself to implement the node structure of a quad-tree index. Feature data is stored in GeoJSON format in the bottom-most directories of the quad-tree, separated by newlines to increase the usefulness of line-oriented differences.
All versioning operations supported by Git work well with this scheme, although there may be some issues with merging after differing changes from separate authors result in differing structures of the quad tree index. Git will keep track of these, but it will make it hard to easily figure out which feature actually changed and which ones were just rebalanced.
Git also provides the ability to retrieve individual directories from old revisions, so having a quad-tree structure reflected in the directories allows performing rough bounding-box queries against old revisions without first retrieving the other datasets.
Currently there is an rudimentary implementation of this storage format built on top of the GeoTools framework. It is capable of reading and writing, but has the following shortcomings:
Further work and investigation is needed, to see if it can scale up to the size of data that geospatial users expect.
In parallel with the first approach, Gabriel Roldan has been implementing a GeoGit backend, drawing on the core concepts of git. At this point this implementation is much more complete than the fully git-based work done by David Winslow, as it's had a much larger investment.
The code for the core repository can be found at https://github.com/opengeo/GeoGIT. This code backs both a GeoSynchronization Service module (a spec by the OGC to synchronize data) and the versioning constructs of WFS2. The plan is to eventually get both in to the standard distributions of GeoServer.
The best place to start with the core concepts of git is Git for computer scientists. I won't repeat all that it says, but basically Directed Acyclic Graphs are cool, and put to really good use. The three main types of things in git are contents (files), commits, and the tree. These are all blobs, and represent everything.
The main difference between standard git and what's needed for geospatial is the structure of the tree. A typical set of work versioned by standard git usually has quite a few branches - every directory has lots more directories. With geospatial we get a very flat tree - not many branches, lots of leafs. For the direction of using git directly to store geospatial information we introduced more branches, making logical geospatial bounding boxes in to single files that were put in r-trees of directories. So we took a bunch of flat features and stretched them out.
For the geogit implementation we instead code things to be optimized for the fact that a typical geospatial representation doesn't have much nesting - it's just a bunch of features. The index is orthogonal, doesn't need to be part of the tree structure. But the problem we had when we tried to use straight git with lots of leafs is that it wouldn't really work. It doesn't scale to a single directory with millions of files.
So to map to geospatial we redid the structures a bit, made the code optimized for the structure of geospatial data. So in the mapping of concepts instead of 'files' as the base content we have geospatial 'features'. And instead of directories we have 'featureTypes' to split up the tree a bit.
A 'commit' is a type of object that points to a tree. It's the tree for the full state of the project at any given time. The commit just represents what's different, and the tree is the Directed Acyclic Graph. The tree holds all states of the history, and commits point to each place of change.
In normal git the tree and commits track the contents of files (mostly text). We want to track the content of 'features', which have no canonical representation. In text there is a canonical representation - it has its own blob format with metadata about the file, like the charset. It knows its state.
So what we do in geogit is create a canonical representation of a 'feature'. We use BXML (binary xml). But this could really be anything, BXML just seemed to be a good candidate because it's nice and small, and Gabriel had been sitting on good readers and writers for it for awhile. This is separate from the actual data - that stays in the database that is versioned. All that's really needed for a datastore to be versioned is for it to produce stable feature id's. When you 'version' a datastore you build up the canonical representation of the features it holds, in BXML, and start building the DAG of changes that happen. These changes are made in the core datastore, but are also all held in the geogit repository. For that repository we use Berkeley DB Java Edition, which is a very solid, robust and small key value store.
The BXML holds the canonical representation doesn't hold any feature names - just the contents. The attributeTypes featureType is stored outside, in the tree. The BXML just stores the actual attribute data, and the geometry is represented as Well Known Binary, as there are good tools to read and diff it.
The geospatial git tree holds a featureid instead of a filename, plus a pointer to the featureType. In time that featureType will be a blob itself, so that we can support the evolution of featureTypes over time with the same system, but for now that isn't implemented, the featureType just comes from GeoServer's catalog.
A diff is just taking two commits that represent state at a given time. Then it travels two trees and finds the difference. Very easy to find the difference because everything is keyed by sha1 hashes, just like git. Hash code is aggregated, so one tree is hashed for its whole contents. Hashes on everything, both containers and individual ones, so can easily compare differences very quickly.
Though the current implementation is Berkeley DB Java Edition it could pretty easily be any kvp repository, potentially using like S3 on amazon for cloud repositories.
The object store doesn't know what it holds, it just has hashes for key, contents for stuff.
For serving it up, you get it directly from postgis as long you are asking from the head. Synchronization happens with every WFS transaction, which gets stored in repository. So every wfs-t coming in has to sync to git repository and to original datastore.
Editing the versioned PostGIS datastore with something like qgis would screws everything up. In git this is just like you edit remote repository without syncing with local. If the backend is unversioned then syncing will require a full scan. This is possible, but expensive operation. With a versioned backend then the sync could just ask for the diffs and put them in to the git repository.
So to work well with mixed editing environments where not everything is going through WFS-T we need to find strategies to make sure the repository gets notified. This could be a postgis trigger or a qgis plugin. Or in the proprietary world it could be a trigger or routine in file geodatabase or a plugin for arcmap. Of course doing this could allow for some cool workflows - ideally you would make plugins so they can work on their own local repository, getting true distributed versioning. It'd be your working tree, you'd make your changes there, but then can push them to a GeoServer, just like you push to a gitorius or github. But short of building those tools can also just make strategies that make sure the repository gets updated, perhaps with a script.
(I think Gabriel may have already worked this next challenge out)
One big technical challenge is handling very large trees. First approach is to make the tree itself a quad tree. Though that may not work for generic case, as we need quick access to trees by ID, not just by geometry. The other approach is to hash featureid's themselves. Current state is flat feature types that take a lot of memory, because each featureid needs to be held in memory for fast access. To truly scale we need to work out how to handle huge trees. The current implementation can handle more than standard git, but we need to make it even bigger, to handle any geospatial information that can be thrown at it.
Other things that need work are branching, though with the core structure it shouldn't be that hard. Much bigger though will be the visualization tools on the client side to figure out diffs and resolve conflicts. Also could use more performance optimizations and testing against large datasets.
WFS-T is supported, using the 'handle' for commit messages. And with WFS2 you can request some different versions. Of interest also is ESRI's REST GeoServices spec, which is recently submitted to OGC. It has no versioning constructs yet, but they could potentially be added. But even without that it could be used by clients to edit with transparent versioning, just things like commit messages likely wouldn't be possible.
Currently the GeoGit implementation has only been tested against PostGIS. But it is designed to work against any datastore that can give out stable ID's. So any database backend, including Oracle, ArcSDE, SQL Server and DB2. Shapefiles should not be used, and still need to make the universe interface so users can't try to.
It should be noted that GeoGit is also totally compatible with versioning backends, like Oracle with Workspace Manager or ArcSDE with versioning turned on. There will obviously be a lot of details to work out, but conceptually it would interact with them the way that git interacts with subversion. The versioning database would be able to have edits and versions and push them to the GeoGit repository. So people could edit directly against the Oracle backend, and you'd see the same edits in the GeoGit view, just like you see SVN changes. It'd be able to be a lot more efficient with the syncing if edits happened abnormally, as it could just get the latest diffs, instead of having to do a whole scan. So in time one should be able to plug GeoGit on top of an existing Oracle Workspace Manager implementation and import its history and then make it so it can be distributed by those operating against GeoGit. Of course the devil is in the details and there is a lot here to tackle, but conceptually things should be quite compatible.
GeoGit has lots of potential for mobile applications, especially disconnected operations. There are two ways these could work - with a repository on board and syncing to a repository afterwards. The first should be possible with any laptop, netbook or windows or android-based tablet. It'd also be possible on android phones, as all could run Berkeley DB and java, and thus run the geogit repository. Every edit in that case would be recorded, and all changes just pushed up when online. All could use OpenLayers mobile tools and html5 for the editing. For things like iphone, ipad and other mobile phones we still could use it for offline editing, using local html5/sqlite storage. The only difference is that all the edits would come in at once. Though one could easily build a system that would queue up the changes and let someone approve each one. Both would be radical improvements over how offline editing is currently done, giving a local store for offline, but would be part of a system that tracks all changes.
Both approaches can use greater investigation. While the GeoGit java code is much more complete the git-based work potentially has a significant advantage - working with all the existing git tools, which are substantial. If it could be made to scale and people could just store their geospatial information in any git repository (including github) then uptake could potentially be a lot faster. Though others may argue that making it work well with geospatial information will require new workflows and integration with gis tools, and that github is only accessible to coders. Our main feeling is that both are worth extensive investigation, and that there's a powerful core to distributed versioning, that should be applied to geospatial information. It may be that neither ultimately survives, but we intend to fully document all we build and learn, so that others can build on top of it.
|Document generated by Confluence on May 14, 2014 23:00|