This page last changed on Jul 10, 2008 by dwinslow@openplans.org.

Summary

This conversation took place in #geoserver as David was handing off the KML regionating code to Andrea (at least for a while).  We talked a bit about regionating in general, a bit about how GeoServer behaves (ie how do you request regionated KML?) and a bit about how regionating is and should be implemented.  In particular we discussed Regionating Improvements for piecewise regionating of datasets so we don't need to preprocess the entire featureset to get a tile.

 Regionating

  • What happens if a feature pushed down to level N does not fit in any single tile at level N?
    • The current approach is to simply assign it to the tile that contains its centroid and not worry about it too much.  This is pretty braindead, but seems to work well in practice. 
      • Concern: This approach produces visual artifacts in the form of very large features that don't show up until you zoom in too far to see the entire feature.  You can work around this by using "geometry size" (ie, area or length for polygons or lines respectively) as the sorting criterion (which doesn't entirely preclude having big features at low zoom levels, but if you still have problems this way then your dataset is just pathological and probably needs human attention to regionate properly).
    • Andrea's suggestion was to bubble the feature up the hierarchy until it does fit into a tile. 
      • Implementation detail: Do we push other features down to compensate? That could involve modifying a lot of tiles for each oversized feature.
      • Implementation detail: Doing this effectively precludes lazy loading, since you can't know what features will end up in the root tile until you've checked that nothing will bubble up from any of the leaves.
      • Other concern: This technique means we're abandoning the requested order of precedence in order to make things 'look better.'  Aside from being philosophically questionable, you'll need to provide some other visual cue to signify relative values of the attribute that you want to visualize.
      • Other concern: This technique is susceptible to piling up many small features at the borders of tiles.
    • A third option is to split the geometry across as many tiles as it needs and keep it at the current zoomlevel.
      • Implementation detail: How does a partial geometry count toward the tile feature limit?
      • Implementation detail: We need to ensure that only one placemark description is output per feature.  This could probably be accomplished simply by assigning the description to the tile containing the centroid, although a pathological feature could be devised that doesn't intersect the tile that contains its centroid.
  • How do you handle datasets with no reasonable attribute to use as 'prominence'?
    • The current approach is to simply sort by "geometry size" which works pretty well for polygons and lines.  For points there is basically no good option; however it is useful to order the features randomly to counteract any clustering done by the database (ie, we frequently saw points lining in a grid pattern when serving from PostGIS.)

 GeoServer Behavior

  • What size should the root tile be? That is, at what scale do features begin to show up?
    • Currently we use the smallest normalized (according to the WMS Tiling Recommendation) tile whose bounds contains the entire dataset. This seems pretty sensible.
    • Another option would be to start at world bounds regardless of dataset, but no strong argument was provided for this approach.
  • David suggested setting up the data regionator to automatically randomize when sorting point data by geometry.
    • This is good because then sorting by geometry will always give decent looking results.
    • This is bad because you asked GeoServer to sort by geometry and that's not what it would be doing.  Andrea suggests that GeoServer should return an error message on such a request; David thinks an error message is not in order since sorting by an attribute that has no variance should simply have no effect.
    • An option would be to have a "guess" strategy that looks at the data and tries to figure out something acceptable (like sorting by the attribute with the highest variance or similar).  If the documentation just stated that GeoServer would try to do its best we could do whatever we wanted for this really, and provide more specific strategies as well.
  • How do you activate the regionating code?
    • To get a single regionated tile, do whatever WMS request with format=KML that you normally would, and add '&format_options=regionateby:data;regionateattr:someattribute' to regionate based on the named attribute.
    • To get a KML document that uses NetworkLinks to load the regionated tiles at the 'right time' while zooming in, point Google Earth at http://...geoserver/gwc/service/kml/<layer>.kml.kmz .  Currently GeoWebCache is required for viewing regionated KML from GeoServer in Google Earth.
    • We should probably also support Google Earth viewing without GWC; this could be done through the KML reflector.

Implementation

Some implementation notes are provided in the parent page to this one; some clarifications were made to this proposal after this discussion as well.

Full Log

(01:55:32 PM) dwins: aaime: could you take a look at http://geoserver.org/display/GEOS/Regionating+Improvements please
(01:55:34 PM) sigq: Title: Regionating Improvements - GeoServer (at geoserver.org)
(01:55:53 PM) dwins: for GeoServer 1.6 you need jdk1.5 or better
(01:55:53 PM) aaime: am I getting you mad with my observations?
(01:56:31 PM) cholmes: arneke: submitted - said to wait up to 10 days
(01:56:35 PM) dwins: nope, just seems like arne and i came up with something that addresses a number of the issues you're bringing up
(01:56:53 PM) dwins: so maybe you can save some time by looking at it
(01:56:57 PM) arneke: cool, we've got until August 2nd
(01:57:41 PM) arneke: dwins: do you have a local, recent copy of geoserver running?
(01:57:50 PM) dwins: yes
(01:58:03 PM) dwins: i have local recent versions of trunk, 1.6.x, and 1.7.x
(01:58:08 PM) arneke: can you try adding two layers to a layer group
(01:58:10 PM) aaime: dwins, looks good, it just does not tell what's happening in memory, and what directly against the db
(01:58:11 PM) dwins: which one are you looking for?
(01:58:15 PM) arneke: and see whether max x and max y are junk ?
(01:58:23 PM) aaime: but I like the fact there is just one algorithm
(01:58:31 PM) dwins: me too
(01:59:02 PM) aaime: however, sortby=importance may not be always possible
(01:59:10 PM) aaime: how do you sort by geometry size?
(01:59:22 PM) aaime: it's a good strategy for the attribute based regionating
(01:59:37 PM) aaime: but not much for the geometry size based one
(01:59:46 PM) aaime: (only can only order on plain attributes)
(02:00:18 PM) dwins: arne and I talked about sorting when the db can do it, and doing a pass over the results from a bbox query otherwise
(02:00:23 PM) gdavis_ left the room.
(02:00:38 PM) aaime: it does not show in the snipped on the wiki
(02:00:40 PM) arneke: (geometry size is tricky with Google Earth. The features your draw first stay on top, so if you draw the big ones first you never see the small ones underneath)
(02:01:13 PM) dwins: so we should probably add a z index / altitude based on how deep in the hierarchy we are
(02:01:14 PM) aaime: arneke, that does not sound very nice...
(02:01:48 PM) arneke: aaime: it's not,, we can reverse the z index explicitly with additional configuration
(02:02:01 PM) aaime: actually, for geometry size based you could do one pass and store the fid and size in a h2 temp table, and then return the result of querying it sorting on the size
(02:02:23 PM) arneke: err.. that's not very easy to parse,, what I mean you can explicitly set a Z index in the KML
(02:02:29 PM) aaime: so if instead of returning a feature you return just the feature id, and have the subclass handle how that is done, you're gold no?
(02:03:23 PM) aaime: I think you can really build the whole tree in a single pass that way (or not?)
(02:03:23 PM) cholmes left the room (quit: Read error: 104 (Connection reset by peer)).
(02:03:33 PM) cholmes [n=cholmes@64.90.184.79.static.nyinternet.net] entered the room.
(02:03:52 PM) dwins: that sounds good, i think
(02:03:58 PM) ***dwins ponders for a moment
(02:04:11 PM) aaime: you're concerned with the time it takes to build the whole beast, right?
(02:04:25 PM) afabiani left the room (quit: "ChatZilla 0.9.83 [Firefox 3.0/2008052906]").
(02:04:39 PM) dwins: yes, right now it is quite slow when it's not doing everything in memory at once
(02:04:54 PM) aaime: for attribute based, lazy building of tiles could be done easily
(02:05:05 PM) aaime: ah, one thing
(02:05:17 PM) dwins: so how would the auxiliary h2 table handle bbox queries?
(02:05:30 PM) aaime: right
(02:05:57 PM) aaime: if you do all in one pass, you don't really need that, you just store the bbox as 4 floats
(02:06:29 PM) aaime: and decide whre to put each feature starting from the root going downwards
(02:06:55 PM) aaime: if you do it lazily hum... I stil think you can do it pretty quickly by properly indexing the 4 elements
(02:07:18 PM) aaime: there is a datastore in gt2 that handles geometryless datasets with bbox stored like that
(02:07:20 PM) aaime: let me see
(02:08:17 PM) iwillig left the room (quit: ).
(02:08:52 PM) dwins: eek. /me's compression filter is whining a lot on 1.6.x branch
(02:10:10 PM) jgarnett left the room.
(02:10:18 PM) dwins: arneke: i see the problem you're talking about on 1.6.x
(02:12:03 PM) aaime: dwins yes, it's a simple query
(02:12:17 PM) aaime: you just have to build an index on all the 4 columns
(02:12:52 PM) aaime: so you'd go thru the whole set of data once, build the support table, and then use it each time you have to build a tile
(02:12:54 PM) aaime: (lazily)
(02:13:16 PM) arneke: dwins: ok, cool, I'll file jira ?
(02:13:25 PM) arneke: the build on artois is 1.7 right?
(02:13:37 PM) dwins: arneke: yes for both
(02:14:18 PM) aaime: ok so my task would be to actually write this, right?
(02:14:27 PM) arneke:
(02:14:48 PM) dwins: aaime: so for bbox = [x1 y1 x2 y2] we would have 'SELECT feature FROM auxiliary WHERE maxX > x1 AND maxY > y1 AND minX < x2 AND minY < y2 ORDER BY importance;'
(02:15:00 PM) aaime: dwins, correct
(02:15:07 PM) aaime: (sort of)
(02:15:14 PM) dwins: i think it's up to you, cholmes has asked me to poke geoserver and look for bugs
(02:15:22 PM) dwins: i just want to add a note to the wiki page
(02:15:25 PM) aaime: sure thing
(02:15:27 PM) aaime: I just need
(02:15:41 PM) aaime: a) huge data sets... I just have tiger 2005, should be enoug (20 million roads?)
(02:15:53 PM) *arneke will hand you that
(02:15:58 PM) aaime: b) some examples of how you actually use the thing as a user
(02:16:00 PM) dwins: we have a 135.000.000 point data set i think
(02:16:16 PM) aaime: ok, and how many years will it take me to download that?
(02:16:25 PM) arneke: 1 >
(02:16:49 PM) arneke: it's 2Gb, but it's not that interesting because the columns are too big
(02:16:55 PM) arneke: i'll make you some
(02:17:11 PM) arneke: so we're dropping the indeces in favor of bboxes ?
(02:17:18 PM) arneke: indexes
(02:17:31 PM) aaime: what?
(02:17:32 PM) KevinIPS: what does this mean?!?! http://pastebin.com/d7c1b7068
(02:17:36 PM) aaime: which indexes?
(02:17:45 PM) dwins: no, the bbox is for an auxiliary table to help with generating the tiled index arneke
(02:17:45 PM) arneke: sorry, I meant tile index
(02:17:49 PM) arneke: ah
(02:17:53 PM) arneke: ok
(02:17:54 PM) aaime: no no, we keep that one
(02:18:04 PM) aaime: and what about the suggestion of splititng the storage in two tables?
(02:18:15 PM) aaime: x,y,z,tileid and tileid,featureid?
(02:18:15 PM) arneke: fine with me, I got that part I think
(02:18:26 PM) aaime: (it will take less space)
(02:18:35 PM) arneke: really ?
(02:18:54 PM) dwins: it will? each featureid is in exactly one tile
(02:19:30 PM) aaime: but with the current layout you repeat the 3 ints for all teh features
(02:19:36 PM) arneke: x,y,z,featureid , was what I thought, with

Unknown macro: {x,y,z}

and

Unknown macro: {featureid}

as the indexes
(02:20:00 PM) arneke: true, but I think the featureid is the killer
(02:20:13 PM) dwins: oh, i see
(02:20:19 PM) arneke: though it probably should just be the number, not the string
(02:20:19 PM) aaime: yeah... well, if we can use x,y,z as shorts
(02:20:30 PM) arneke: no
(02:20:38 PM) aaime: normal integers (not longs?)
(02:20:53 PM) arneke: hmmm
(02:21:02 PM) arneke: I think I up'ed it to long, but then went back
(02:21:14 PM) arneke: but they do get pretty huge pretty fasst
(02:21:34 PM) arneke: whatever is in there now I think works for just about every layer
(02:21:38 PM) aaime: ok, I will use just a simple flat table
(02:22:43 PM) aaime: Oh, I fear there is a hole in the algorithm in the wiki page
(02:22:48 PM) aaime: you stop when the node is full
(02:23:04 PM) aaime: but if the features afterwards cannot fit the sub-nodes?
(02:23:16 PM) aaime: dont' you have to make a full scan anyways?
(02:23:39 PM) dwins: this algorithm is intended to be lazy and only find the features in a single tile (and whatever other tiles it needs to do that)
(02:23:58 PM) dwins: i'm not sure what you mean about features not fitting the sub-nodes
(02:24:01 PM) arneke: yeah, I think you're right, we went back and forth on the assumption that you could request an ordered set
(02:24:20 PM) aaime: if the geometry is bigger than each sub-node area?
(02:24:28 PM) aaime: you have to move it into the parent no?
(02:24:32 PM) aaime: and you may have to recurse up
(02:24:57 PM) arneke: if the datastore can't sort, I agree, you have to do a tablescan
(02:25:01 PM) arneke: bah
(02:25:05 PM) dwins: right. damn features with area.
(02:25:14 PM) aaime: ah I see
(02:25:17 PM) dwins: KevinIPS: btw, I haven't seen that error before
(02:25:19 PM) aaime: for points the algorithm does work
(02:25:26 PM) aaime: it's for lines and polygons that it does not
(02:25:28 PM) KevinIPS: oh...damn
(02:25:40 PM) dwins: okay... so do we sort by geometry all the time?
(02:26:03 PM) dwins: so that those features always come up first?
(02:26:13 PM) aaime: well, it's not only about size, but also about location
(02:26:18 PM) dwins: (what if we have lots of big features and that doesn't work?)
(02:26:33 PM) aaime: the geom may be smaller than a sub-tile, but located at the borders of them
(02:26:57 PM) dwins: no cheap tricks then
(02:27:01 PM) aaime: it seems to me you cannot do a lazy loading
(02:27:39 PM) arneke: wait, what's the issue about location ?
(02:28:06 PM) aaime: you have to make a full tree build in one go
(02:28:06 PM) aaime: and try to make it as fast as possible
(02:28:06 PM) vhamer [n=vhamer@pool-162-83-230-190.ny5030.east.verizon.net] entered the room.
(02:28:06 PM) aaime: arneke, if that dataset if 4gb (compressed?) I can download it overnight
(02:28:09 PM) aaime: link?
(02:28:21 PM) aaime: consider a polygon that's smaller than each of the 4 sub tiles
(02:28:25 PM) dwins: so really we can't just do it halfway
(02:28:29 PM) aaime: but it happes to sit exactly at the center
(02:28:43 PM) aaime: so that it overlaps partly with all of them
(02:28:45 PM) arneke: aaime: need to dump, will do it now
(02:29:01 PM) aaime: you have to put it in the current node as opposed as in teh childs no?
(02:29:14 PM) dwins: currently we don't do that
(02:29:38 PM) arneke: no, we just do it based on the centroid
(02:30:33 PM) aaime: oh, then you have issues when panning around
(02:30:33 PM) aaime: like polygons that should be there, but they are not
(02:30:33 PM) aaime: because they ended up in a tile that only partially contains them?
(02:30:33 PM) aaime: (and they show up only when the tile that actually contains them shows up?)
(02:30:33 PM) dwinslow [n=david@64.90.184.79] entered the room.
(02:30:49 PM) aaime: so you have panning issues all right
(02:31:05 PM) dwinslow: other solutions for that issue were to split up the geometry (eek) or to push it up
(02:31:14 PM) dwins left the room (quit: Nick collision from services.).
(02:31:18 PM) aaime: yep, I'd go for pushing it up?
(02:31:35 PM) dwinslow: that seems like the easiest way to make it look right
(02:31:38 PM) dwinslow is now known as dwins
(02:31:52 PM) aaime: yeah, it's just that it means
(02:31:53 PM) arneke: I disagree
(02:33:20 PM) aaime: arneke, explain
(02:33:20 PM) dwins: although in practice the centroid technique seems to be working pretty well
(02:33:20 PM) aaime: ok, let me make and example
(02:33:20 PM) Kred left the room (quit: "ta ta!").
(02:33:20 PM) aaime: you are making a regionation based on wealth
(02:33:20 PM) aaime: so richest countries fill the top nodes
(02:33:20 PM) aaime: Russia is very big, but not very rich, so it ends up in a lower level node
(02:33:32 PM) aaime: you end up not showing it at all because it's poor until you get very close to its centroid?
(02:34:04 PM) aaime: did I explain myself?
(02:34:18 PM) dwins: aaime: wireless here has gone byebye
(02:34:29 PM) dwins: let me go see if arne is still on irc
(02:35:27 PM) arneke: sorry, cholmes here
(02:37:26 PM) vheurteaux left the room (quit: ).
(02:37:29 PM) arneke: aaime: makes sense, but you will also find examples where there is data piled up along the tile splits
(02:37:46 PM) arneke: more than you can fit into the super tile
(02:38:16 PM) dwins: in which case the only thing to do to make it look right is to include partial geometries in each tile
(02:38:20 PM) dwins: which is complicated
(02:38:20 PM) aaime: ok, yet that means you'll have some visualization issues
(02:38:23 PM) arneke: it's sort of like the metatiling issue, at some point you have to compromise
(02:38:32 PM) arneke: the boundaries have to go somewhere
(02:39:09 PM) arneke: a lot of the data i've been given doesnt have that kind of importance
(02:39:32 PM) arneke: like, for big datasets, quite often there's nothing sensible to regionate on
(02:39:50 PM) aaime: well, in that case, what do you do?
(02:40:06 PM) arneke: random is better actually
(02:40:12 PM) arneke: otherwise, if the data comes out lumped
(02:40:18 PM) aaime: for line and polygon size is a good one
(02:40:22 PM) arneke: the tiles fill up along the edges
(02:40:27 PM) aaime: but for point yes, random is probably best
(02:40:48 PM) arneke: yeah, but if you sort by polygon size
(02:40:50 PM) iwillig [n=iwillig@cpe-72-224-140-11.maine.res.rr.com] entered the room.
(02:40:56 PM) arneke: then they'll bubble up as far as they can go anyway
(02:41:13 PM) arneke: and even then
(02:41:24 PM) arneke: the polygons are usually fairly small om a world scale
(02:41:36 PM) arneke: so you have to put dots to be able to see it at all
(02:41:43 PM) arneke: unless it's a political map or something
(02:42:15 PM) aaime: sorry, if the dataset covers only a county you still start from the "world"?
(02:42:42 PM) aaime: wouldn't it be better to start from the normalized tile that does contain the data bounds?
(02:43:12 PM) arneke: no, we do start from normalized bounds
(02:43:18 PM) arneke: ie the biggest tile that contains all features
(02:43:29 PM) arneke: i chose my words poorly
(02:43:35 PM) arneke: but even for forest stewardship
(02:43:48 PM) arneke: the polygons are small when you look at a country
(02:43:55 PM) aaime: Yeah, I get it
(02:44:48 PM) arneke: crap.. our internet is mucking around,,
(02:45:01 PM) arneke: can't get to the database , because no new TCP connections go through the router
(02:45:24 PM) vhamer left the room (quit: ).
(02:45:24 PM) aaime: yet it still seems to me geometry size ordering make sense for quite a bit of data sets?
(02:45:24 PM) aaime: (if you don't have anything else to use?)
(02:45:24 PM) dwins: no doubt
(02:45:30 PM) dwins: definitely better than random
(02:45:32 PM) dwins: ugh
(02:45:41 PM) aaime: random is good at least for points
(02:45:46 PM) iwillig left the room (quit: Read error: 60 (Operation timed out)).
(02:46:17 PM) dwins: maybe we should have it automatically do a random ordering when you ask to sort by a geometry and the type is point?
(02:46:34 PM) dwins: and then the 'geometry' strategy will always be a pretty decent guess
(02:46:47 PM) arneke: that sounds good
(02:47:09 PM) arneke: i dunno whether we take a performance hit for that?
(02:47:20 PM) arneke: but it looks like we can do it lazily, right ?
(02:48:01 PM) aaime: I'd prefer to return a service exception suggesting to use random directly
(02:48:01 PM) aaime: so that it's clear you can use it for lines and polygons as well if you want to
(02:48:01 PM) aaime: What I actually don't like much is that you ask for something that does not make sense and the software does something other than you asked for
(02:48:01 PM) cholmes: (nice talk about kml stuff - if nothing else do copy this chat and post it on a page in the kml RnD section, so there's a record. Or write up what you talk about)
(02:48:06 PM) dwins: well, sorting by a point geometry doesn't really merit an exception
(02:48:12 PM) dwins: right?
(02:48:26 PM) aaime: As a user I'd prefer an exception telling me to use random
(02:48:34 PM) aaime: at least I know what's happening
(02:48:42 PM) aaime: (control freak here)
(02:49:12 PM) arneke: i'm fine with either
(02:49:26 PM) arneke: beats making a random column
(02:49:30 PM) arneke: like I've done
(02:49:38 PM) dwins: eew
(02:49:48 PM) aaime: Anyways, tomorrow I'll start implementing the algorithm you proposed, and if a geometry does not fit in the current node, I just throw it in a random child right?
(02:50:02 PM) aaime: (well using the centroid that will happen anyways)
(02:50:09 PM) dwins: we've been throwing it in the child containing the centroid
(02:50:12 PM) arneke: yeah, not random ?
(02:50:23 PM) aaime: ok ok, sorry!
(02:50:27 PM) arneke: i'd rather you throw an exception, im a bit of a control freak
(02:50:32 PM) arneke: j/k
(02:50:45 PM) aaime: (I've been looking at this stuff for only 3 hours)
(02:50:58 PM) dwins: no worries
(02:51:43 PM) aaime: I really have to go now, it's 20.50 here
(02:51:54 PM) aaime: we'll chat again about this tomorrow?
(02:51:58 PM) arneke: ok, will email you a link in 20 minutes with dumps
(02:51:59 PM) dwins: of course
(02:52:17 PM) dwins: i'm making some notes on the wiki to clarify about that pseudocode as well
(02:52:27 PM) aaime: (in the meantine I'll start implementing the algorithm, if we decide to change it, I'll refactor it)
(02:52:47 PM) aaime: ok, I'll come back to read the mail after dinner
(02:52:49 PM) aaime: see you!

Document generated by Confluence on May 14, 2014 23:00