Add grouped-linemerge generalization#2482
Conversation
|
From a first quick glimpse this looks really interesting. I'll need a while to digest this though. Can you look at the build failures and fix them? |
|
That's great to hear that it's interesting!!! I've fixed the clang-tidy, and I've fixed a performance mistake. Here are some numbers for what this does, and how long it takes, over downtown Los Angeles. Note that in practice, we would never expire an entire tile at Z13. More likely Z18 or Z19. I'm just showing it for clarity. Note that the most relevant column is probably As we get to higher zooms, the reason why
|
|
If I understand this correctly you tested this on Los Angeles, did you also run tests on the full planet (or larger areas)? From my experience it is often the case that something works fine on a small dataset because basically the whole dataset fits into memory, but performance goes dramatically down if you run it on the full planet, even though each individual query only looks at local data. Generally the performance requirements are that we need to be able to keep up with minutely changes on a reasonable machine. And ther still needs to be enough capacity left so that we can do the actual querying of the database for rendering. That's rather vague, I know. |
|
I have downloaded California, New York, New Jersey, Texas, Poland, Netherlands, Sweden, some of Japan, and some of England. It's larger than what fits into my ram. I think it's reasonable to compare "how long does it take to get all lines && the updated tile" with "how long is the rest of the incremental update"? I will look into trying to measure performance "cold" with nothing cached in memory. If performance is a problem, I know for sure I could make it much faster if the generalization could receive the exact ways that were edited, like carto's flex lua code, rather than Z/X/Y tiles. That way the slow part (the &&tile_box) can be skipped entirely, and the remaining steps would only walk the network of the actually edited way(s). |
|
Here is an updated table. I restarted postgres between each run, and I used drop_caches as well. It is several times slower which is probably what you expected? The last paragraph of the last comment ^ still applies. I did some rough tests and I can get the total time to maybe 80ms cold, 40ms warm for a single way (recursion depth was a few dozen, I picked a long way. and zoom level is now irrelevant) if the generalization could know the exact endpoints of the way that got expired (way_add, modify, delete).
|
|
But surely this can be afforded? Doesn't osm.org have on the order of ~1 change per second? And much of this is selects that are latency-bound not bandwidth-bound so given multithreading this should be basically fine I'd hope |
You don't need to base this on tiles, you can base this on the ways that changed! For instance, you can have a separate table where you store (the "way" processing function) all the ids of ways that changed, then use that table as input for the rest of the processing in the "gen" step. But it is not as easy as one would think to implement this. You have to correctly handle deleted ways and all the different kinds of change, including cases like this: Say you have two ways that are not connected, now a third way is changed which makes it connect to both existing and unchanged ways allowing all of them to be merged.
All of this is I/O bound so multithreading is not going to help. |
|
Okay, I will try to figure out how to get a table of ways that changed 👍 For the case of a new way that connects existing ways, yes that's the rub. Your case can already happen currently, when a way is itself larger than the tile. It's actually even worse than that, because ST_LineMerge will stop merging when the degree of a node goes from 2 to 3, so adding a way can actually split a road geometry. There were broadly two ways to do this: 1. write extremely complicated recursive CTE queries to try to fix up exactly the "edit" and no further (in your case, it would just connect those two geometries without recursively exploring them on the input table side) or 2. explore the entire geometry, delete the entire geometry, re-ST_LineMerge it. I picked the latter in this PR. I actually think multithreading would help specifically because of how the recursive CTE requires many serial lookups, that is I/O bound but there is a difference between being bound by I/O latency for dependent/sequential random lookup versus being bound by I/O throughput (whether the lookups are sequential or random). So if you were to have several tiles that are all expired, you would probably get a speedup with each thread expiring one of them, versus one thread expiring all sequentially (note: this PR doesn't quite do that, it uses joins to get postgres to do many tiles "at once" internally). But my point on multithreading was not actually about two threads that are both doing expiring, rather about render threads not being blocked by expiring threads, because the expiring threads (and the render threads) are doing dependent lookups through the GiST structure with inherent latency in the processor-side logic from one read to the next. In any event I will think about this more leijurv/openstreetmap-carto@b79c0fa This is how I had it, but I think I'll need to use a totally different method (not tile based at all). My initial hope is that whenever a way is added, updated, or deleted, I can simply add the exact (x,y) of both endpoints to a table of geometries to be updated, then the generalization can re-explore the geometry from there? That would be the happy / easy path, but I am probably missing something. |
lonvia
left a comment
There was a problem hiding this comment.
The mix of tile expiry and recursive query is interesting. I actually think that the solution is simpler because of your use of tile-based expiry instead of way-based. I did an implementation with way-based expiry for waymarkedtrails and only got it to work by tracking the IDs of the source ways for each merged way. That would make this implementation significantly more complex.
ST_LineMerge has a few quirks which might lead to unexpected results but that is something that can be easily addressed later.
Performance stands an falls with the complexity of the merged ways, which in turn depends on the source table and the grouping. I'd like to run the import query on a planet to see how long it takes. That should give us a ballpark figure how expensive merging is.
| AND ST_Y(ST_StartPoint(d."{geom_column}")) = n.y) | ||
| OR (ST_X(ST_EndPoint(d."{geom_column}")) = n.x | ||
| AND ST_Y(ST_EndPoint(d."{geom_column}")) = n.y) ) | ||
| )"); |
There was a problem hiding this comment.
Is this delete really necessary? Are there any cases of updating that wouldn't be covered by the DELETE below in line335?
There was a problem hiding this comment.
Yes because the recursive CTE scan, that we insert in Step 5, includes fragments that are entirely outside the current tile.
Consider the following four ways, that will LineMerge from A,B,C,D to AB,C,D
| /
| C
| /
--A--•--B--•
| \
| D
| \
^ tile boundary
The line 335 delete will only delete the AB geometry. The line 325 delete is what deletes C, D, which is necessary, because Step 5 will reinsert all AB,C,D.
And conversely, the reason why 335 is needed as well as 325, is much easier to justify: when you delete a way that is entirely within this tile, line 335 deletes it from the linemerged table too.
There was a problem hiding this comment.
The important question here is what exactly has changed. If A has changed its tags or geometry then both tiles right and left of the boundary will be expired and in your set or regions to recompute. While your CTE recurses outside the expiry tile area, it does not recurse to a merged way that doesn't intersect with a expired tile.
I might be wrong of course, which is why we are going to need lots and lots of tests for the update case, once the implementation has settled down a bit.
There was a problem hiding this comment.
In this first example, A has not changed. But we don't know that: imagine this tile expired for some unrelated reason. The tile to the right has not expired.
There was a problem hiding this comment.
The point is that if all I know is "something happened in this tile", I will explore every network of ways, some of which have changed, most of which haven't, but I don't know the difference.
| R"(CREATE INDEX IF NOT EXISTS "{idx_endpt}" ON {src} USING btree)" | ||
| R"( (ST_X(ST_EndPoint("{geom_column}")),)" | ||
| R"( ST_Y(ST_EndPoint("{geom_column}"))) {index_predicate})"); | ||
| } |
There was a problem hiding this comment.
I understand your reasoning for an extra index here but I think this still could be a geometry index on the points, i.e.
CREATE INDEX ... USING gist(ST_StartPoint(geometry))
The you can use ST_Intersect on the start and end points below making the code a lot more readable.
There was a problem hiding this comment.
Yes you are correct. GiST(ST_StartPoint(geometry)) is workable to replace btree(ST_X(ST_StartPoint(geometry)), ST_Y(ST_StartPoint(geometry))), however the btree is significantly faster. The reason is that we are only ever doing exact lookups with ==, so the GiST overhead to support range lookups && is unnecessary.
I do not have any concrete numbers for this use case, but back when I was doing a very similar recursive CTE here openstreetmap-carto/openstreetmap-carto#951 (comment) I found that the GiST made the recursive CTE overall about 2 or 3 times slower than the btree
I am happy to go either way but first I think I need a more up-to-date performance answer in tradeoff against what I completely agree is a loss to readability.
In order to get a real performance answer I think I will need to set up a whole planet osm in Postgres, and apply real osmchanges, which I have been putting off but might do this weekend
There was a problem hiding this comment.
Ah, I thought the main performance loss was because of the relatively large bboxes for ways. A gist of points wouldn't have that problem. But I wouldn't be surprised if gist is slower overall.
Is it possible to put a ST_Point in a btree and get equality comparison? That would give you the best of both worlds.
There was a problem hiding this comment.
Is it possible to put a ST_Point in a btree and get equality comparison?
Yes. I ran a quick test just now. Using the same data as this comment #2482 (comment) where I expire a Z13-size area (this tile) and explore 4430 total endpoints (some of those roads extend quite far)
It's interesting that hot vs cold doesn't matter much for GiST - my interpretation is that the btree index is more I/O-bound, while the GiST index does more nontrivial CPU work while traversing to its goal. The speedup from cold to warm is similar though, in total time (not relative time). So I would suspect that the total number of disk reads is about the same between btree and gist, it's just that gist is doing way more computations along the way.
| Variant | cold (median) | cold (min) | warm | warm per-probe (×4430 loops) | index disk space |
|---|---|---|---|---|---|
| A — (x,y) btree | ~247 ms | 235 ms | 67 ms | 0.002 ms | 364MB |
| B — point btree | ~282 ms | 229 ms | 110 ms | 0.007 ms | 576MB |
| C — point gist | ~1070 ms | 1033 ms | 928 ms | 0.09 ms | 479MB |
I have committed b0dd099
It is an improvement to readability, a bit slower to walk, but surprisingly a bit faster to delete the old geometries, compare to #2482 (comment)
| expired tile | region | walk | collect | delete | insert | total |
|---|---|---|---|---|---|---|
| z13 | 71 | 303 | 69 | 155 | 90 | 688 ms |
| z14 | 70 | 158 | 33 | 72 | 36 | 368 ms |
| z15 | 74 | 71 | 16 | 40 | 14 | 215 ms |
| z16 | 73 | 73 | 8 | 32 | 12 | 197 ms |
| z17 | 84 | 66 | 11 | 27 | 16 | 203 ms |
| z18 | 71 | 52 | 7 | 24 | 7 | 161 ms |
| z19 | 82 | 71 | 16 | 36 | 13 | 218 ms |
| z20 | 71 | 59 | 6 | 29 | 7 | 172 ms |
| AND ST_Y(ST_StartPoint(l."{geom_column}")) = n.y) | ||
| OR (ST_X(ST_EndPoint(l."{geom_column}")) = n.x | ||
| AND ST_Y(ST_EndPoint(l."{geom_column}")) = n.y) ) | ||
| AND {where} |
There was a problem hiding this comment.
One way to make this recursive CTE cheaper is to join here only with nodes added in the last recursion. You need to be very, very careful though not to create endless loops when doing that. So maybe an optimization for later rather than now.
There was a problem hiding this comment.
I think this is true already, I've looked at it in EXPLAIN ANALYZE and WorkTable Scan on nodes n (rows=43 loops=319) shows each iteration joins only ~43 frontier rows, not the 13613 accumulated. If it were re-joining the whole set, that scan would show thousands of rows. It works because I am doing WITH RECURSIVE ... UNION rather than WITH RECURSIVE ... UNION ALL.
|
(I have been focusing more on ST_LineMerge at render time (see here: openstreetmap-carto/openstreetmap-carto#951 (comment)) but I am still interested in this PR as an alternative possible path)
I admit I do not know how osm2pgsql works and clearly you know more than me But this is my plan: Right now we find deleted ways, updated ways, and created ways, by "rounding up" to a Z/X/Y tile. i.e. we get a "low resolution picture" of where edits have happened. Instead, I'm imagining saving in the exact same way, at the exact same time, expiry data that is simply more precise: the exact |
|
The tricky part with updates here is that your algorithm runs while the tables are in a in-between state: the source table has already all the updated tags and geometries. Deleted ways are gone from the source table without a trace. The destination table, on the other hand, has still the state from before the update was applied. So a simple endpoint matching won't really work. You need to track former way states and deleted ways as well. And that is somewhat tricky with osm2pgsql. The tile expiry kindly solves the problem for you because it does the expiry always on the old and the new geometry. Thus, the expired region cover where ways were before and after the update, nicely side-stepping the issue at the price of invalidating a bit more than necessary. |
|
I did a quick test on the planet with a table of named highway using highway type, name and ref as grouping criterion. The source table has 114_456_912 rows. Computing the destination with grouping and line merge takes 9min30s and produces a table with 35_462_494 rows. Utterly feasible. |
Wait sorry, I don't mean to be difficult, but just to clarify: Yes, by the time the generalization runs, a deleted way is completely gone from I thought this was possible because joto said above: Pseudocode: def way_created(way):
INSERT INTO glm_expired_points(pt) VALUES (ST_StartPoint(way));
INSERT INTO glm_expired_points(pt) VALUES (ST_EndPoint(way));
def way_updated(old_way, new_way): # not sure how this works
INSERT INTO glm_expired_points(pt) VALUES (ST_StartPoint(old_way));
INSERT INTO glm_expired_points(pt) VALUES (ST_EndPoint(old_way));
INSERT INTO glm_expired_points(pt) VALUES (ST_StartPoint(new_way));
INSERT INTO glm_expired_points(pt) VALUES (ST_EndPoint(new_way));
def way_deleted(way):
INSERT INTO glm_expired_points(pt) VALUES (ST_StartPoint(way));
INSERT INTO glm_expired_points(pt) VALUES (ST_EndPoint(way));I don't think I need to store the way itself, or delay its deletion/update in any way. I only need to know its endpoints.
Endpoint exact matching would not work because an endpoint of an input way is probably became internal point of an output way. Instead, I would have a GiST on the post-linemerge output table. The lookup of "a way was deleted (its linestring is lost), but I do know what its endpoints were" --> "which linemerged geometries are now expired and need to be rescanned" is just
Great! |
|
Okay, I implemented a version that uses exact endpoints, rather than tiles. ddb7de4 Here is an explanation. This generalization needs to know which connected components have been changed, so that it can re-merge only those. Currently in this PR, it gets expired Z/X/Y tiles, and so it must re-walk every road that touches that tile. This is quite slow due to the GiST A few concerns addressed Why not do this on the Lua side? Why change core C++? The flex geometry API doesn't have a method to get the startpoint/endpoint of a way. Could add one. But then still, deleting a way is still a problem, it doesn't call process_way. By the time you call process_deleted_way, the exact node locations could be gone (?). So, doing this in Lua would need new accessors and a new deleted geometry hook. It is just a less natural way to hook into deleted way endpoints. Why not expire only tiles at the endpoints of the way, not the middle? If I edit a long way, maybe we could expire just exactly two tiny tiles, like Z20, one at each endpoint? This is still not perfectly quantized, I cannot look up the way with Why add this code to expire output rather than a totally new system? It is true that endpoints are not the same as tiles but they are triggered by the same event on the same column. It is able to reuse quite a lot of code while staying backwards compatible. Summary: Performance? I've printed several tables above, but the short version is that there was a noise floor around 150ms where even for a single node update where the expiry is tiny I can't get it faster because of all the time spent on all the unrelated ways and slow GiST lookups. With this version, picking single random ways in the same tile (this one), it takes an average of ~68ms to fully reexplore the network. Note that this is a dense road network so the recursion depth here ranged from 1 to 51, and the total ways looked up ranged from 2 to 74. And this parallelizes quite well, if I go from 1 point to 10 points, it only gets about 2x slower not 10x (it goes to ~125ms). If I 10x it again, to 100 random points, it goes to average ~407ms. So osmchanges with many edited nodes are not a problem even if they all are touching dense road networks. And note every measurement in this paragraph was completely cold - I restart Postgres between every query, and I flush my filesystem cache too. If I rerun it warm, it's almost 3x faster, to average ~26ms for 1 point. I'd be perfectly happy to revert this though, if it's judged to not be worth it. |
Summary
This PR adds a new generalization strategy,
grouped-linemerge, toosm2pgsql-gen. It maintains a table of merged linestrings, the equivalent of:The key is that this table is global, rather than local to your tile. It is kept up to date incrementally as the source data changes (and it never re-scans the entire planet to do so).
It works similar to the existing generalizations
riversandvector-union. You configure it from your flex style likeosm2pgsql.run_gen('grouped-linemerge', {...}), and you runosm2pgsql-genthe same way (after import, and after update).Motivation
The motivation is merging ways in OSM-Carto with the same rendering. Here's the relevant issue: openstreetmap-carto/openstreetmap-carto#951
OSM ways are often extremely fragmented, only going a block or two at a time in urban areas, due to relations and tagging changes. This significantly impairs rendering (see images near the bottom of that issue). We could LineMerge these ways within each tile as it's rendered, but this makes it really challenging to have consistent geometries for long ways that span metatile boundaries. (see openstreetmap-carto/openstreetmap-carto#5229 and openstreetmap-carto/openstreetmap-carto@master...leijurv:openstreetmap-carto:recursive-cte)
This PR allows us to have a global LineMerge for all ways, grouped only by the colunms that actually affect rendering (like
name,ref,highway,layer, etc). There are no tile-boundary artifacts, and it can be maintained incrementally quite efficiently.How it works
Initial build
The initial build is very simple - we just run that exact query to do a grouped
ST_LineMergeacross the whole world. This takes a few minutes, because the number of ways in each group is often not so large. We also make an index on the start and end points of each way, this is not strictly required but it significantly helps performance of the incremental update.Incremental updates
Generalizations are able to receive the expire list of tiles whose geometry has changed. The expire list includes tiles where something was deleted, changed, or added. One possible approach here would be to select all roads in the entire planet that match any grouping key that was touched, and re-LineMerge them all. I ruled this out because some road names are very frequent. Imagine if moving a node on "Main Street" in some town, resulted in re-scanning every road named "Main Street" in the entire world. I assumed this was not an option from a performance perspective. Instead, I used a recursive CTE to spider through the geometry from the expired tile. Each changed area becomes a seed from which it walks outwards. To bound this walk, it must match an exact endpoint, and it must match every element of the grouping columns. This finds the full connected component, which replaces the old connected component geometry.
Usage
Then:
Details
I believe this works in general for incremental updates, but the recursive CTE was challenging to get correct and performant. The reason for the new index is so that we can look up ways by their exact endpoint match. Without the new index, we'd use the existing GiST to lookup the way, which means that even if we ask PostGIS for
ST_Intersector some such, it will actually mechanically do&&on the way's entire bbox. This works, to be clear, but it means that if I want to find a way that ends on an exact point, GiST will actually give me all the ways whose entire bbox covers that point, which means less than one percent of the ways that the index gives me are actually eligible. See here openstreetmap-carto/openstreetmap-carto#951 (comment) for discussion about that.Tests
I put in
tests/test-gen-grouped-linemerge.cppa differential fuzz test. It performs hundreds of random connect/disconnects in a small grid of points. I intentionally set it up so the degree of each node could vary from 0 to 4, which stresses howST_LineMergedoes not merge nodes with a degree above 2. At all times, the incrementally updated geometry must exactly match whatST_LineMergewould have done.I have run this locally and it works beautifully. Here's the commit to run it leijurv/openstreetmap-carto@b79c0fa and it does add a bunch of labels (looks like this openstreetmap-carto/openstreetmap-carto#951 (comment))