Tuesday, July 31. 2007
I am happy to report that the trick of encasing a postgresql pgsql function within a postgresql sql function was a big success.
Now for the Recap of the problems with using plain pgsql and why we went down the postgresql sql function encasement approach.
Recap: Plain old pgsql
Building on our expandoverlap_metric function, we add a new function and a new type.
CREATE TYPE pgis_nn AS
(gid integer, dist numeric(12,5));
CREATE OR REPLACE FUNCTION _pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))
RETURNS SETOF pgis_nn AS
$BODY$
DECLARE
strsql text;
rec pgis_nn;
ncollected integer;
it integer;
--NOTE: it: the iteration we are currently at
--start at the bounding box of the object (expand 0) and move up until it has collected more objects than we need or it = maxslices whichever event happens first
BEGIN
ncollected := 0; it := 0;
WHILE ncollected < numnn AND it <= maxslices LOOP
strsql := 'SELECT currentit.' || sgid2field || ', distance(ref.geom, currentit.' || sgeom2field || ') as dist FROM ' || lookupset || ' as currentit, (SELECT geometry(''' || CAST(geom1 As text) || ''') As geom) As ref WHERE ' || swhere || ' AND expand(ref.geom, ' || CAST(distguess*it/maxslices As varchar(100)) || ') && currentit.the_geom AND expandoverlap_metric(ref.geom, currentit.the_geom, ' || CAST(distguess As varchar(200)) || ', ' || CAST(maxslices As varchar(200)) || ') = ' || CAST(it As varchar(100)) || ' ORDER BY distance(ref.geom, currentit.the_geom) LIMIT ' || CAST((numnn - ncollected) As varchar(200));
--RAISE NOTICE 'sql: %', strsql;
FOR rec in EXECUTE (strsql) LOOP
IF ncollected < numnn THEN
ncollected := ncollected + 1;
RETURN NEXT rec;
ELSE
EXIT;
END IF;
END LOOP;
it := it + 1;
END LOOP;
END
$BODY$
LANGUAGE 'plpgsql' STABLE;
Plain old pgsql is tested and fails
/**Problem calculate the 5 nearest neighbors in table testpolys for the first 500 records **/
/**uncoated: pgsql -
fails with error ERROR: set-valued function called in context that cannot accept a set
SQL state: 0A000
Context: PL/pgSQL function "_pgis_fn_nn" line 16 at return next **/
SELECT g1.gid as gid_ref, (_pgis_fn_nn(g1.the_geom, 10000, 5,100,'testpolys', 'true', 'gid', 'the_geom')).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
Recap 2: PgSQL function wrapped in an sql function blanket
CREATE OR REPLACE FUNCTION pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))
RETURNS SETOF pgis_nn AS
$BODY$
SELECT * FROM _pgis_fn_nn($1,$2, $3, $4, $5, $6, $7, $8);
$BODY$
LANGUAGE 'sql' STABLE;
This time it works!
These tests were done on a slower unoptimized machine than my prior tests so all the numbers are higher.
This was done on a windows XP 2.66GHZ single processor, 2 GIG ram.
/**coated: pgsql - IT WORKS! and is faster than all other prior solutions**/
--Total query runtime: Total query runtime: 8075 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref, (pgis_fn_nn(g1.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(g1.gid As varchar(30)), 'gid', 'the_geom')).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 15401 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 44063 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 slices
SELECT g1.gid as gid_ref , (fntestpolys_nn_better(g1.gid, g1.the_geom, 10000, 5,100)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: Total query runtime: 109072 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors
SELECT g1.gid as gid_ref, (fntestpolys_nn(g1.gid, g1.the_geom, 10000, 5)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
/**Problem calculate the 5 nearest neighbors in table testpolys for all records in testpolys (3000 records) **/
--Total query runtime: 124411 ms. 15000 rows retrieved.
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid) g1;
--Total query runtime: Total query runtime: 104105 ms. 15000 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref, (pgis_fn_nn(g1.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(g1.gid As varchar(30)), 'gid', 'the_geom')).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid) g1;
Use Cases
Now it should be easy to answer all sorts of nearest neighbor questions with a simple SQL statement. Here is a simple one with our test data
--Grab additional attributes (in this case geometry) from the nearst neighbor
--Total query runtime: 11012 ms. 2500 rows retrieved.
SELECT g1.gid as gid_ref, g1.nn_dist, g1.nn_gid, g2.the_geom
FROM (SELECT tp.*, (pgis_fn_nn(tp.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(tp.gid As varchar(30)), 'gid', 'the_geom')).gid As nn_gid,
(pgis_fn_nn(tp.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(tp.gid As varchar(30)), 'gid', 'the_geom')).dist As nn_dist
FROM (SELECT * FROM testpolys ORDER BY gid LIMIT 500) tp) g1, testpolys g2
WHERE g2.gid = g1.nn_gid
ORDER BY g1.gid, nn_dist;
Here is a real live use case one, but I haven't tested it out. I'm hoping it will run and be decently fast.
/**All buildings where the nearest fire station is greater than 3 miles away (note my geometries are in meters) and give the name of the firestation
and sort by the building whose closest station is furthest away**/
/**Data sets: buildings http://www.mass.gov/mgis/lidarbuildingfp2d.htm, firestations: http://www.mass.gov/mgis/firestations.htm **/
SELECT g1.gid as gid_ref, f.name, g1.nn_dist, g1.nn_gid
FROM (SELECT b.gid,
(pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).gid as nn_gid,
(pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).dist as nn_dist
FROM (SELECT * FROM buildings) b) As g1, fire_stations f
WHERE f.gid = g1.nn_gid AND g1.nn_dist > 1609.344*3
ORDER BY g1.nn_dist DESC
I'll publish a tutorial and some more use cases on the main BostonGIS site next. Hopefully the timings on these tests will be decent.
UPDATE: I have published this to main BostonGIS site. The article can be found here PostGIS Nearest Neighbor: A Generic Solution. I also changed some of the type definition slightly so I don't need to alias fields so much.
Monday, July 30. 2007
In our last attempt we used a recursive SQL function to do a nearest neighbor search in PostgreSQL. The problem with that solution is that it requires us to create a helper function for each dataset we want to scan for nearest neighbors.
The reason I chose this instead of a PLPGSQL solution was the following:
- Easier to prototype a solution in SQL than PLPGSQL
- PostgreSQL planner doesn't allow set returning PLPGSQL functions in the select part of a statement (has to be in the from) and currently the only way to return sets of records from a function where the function parameters dynamically depend on inputs from a from clause is putting it in the select. I described my annoyance with this in Solving the Nearest Neighbor Problem in PostGIS
- SQL functions are basically treated like inline functions for all intensive purposes which means whats going on inside the function is pretty transparent to the planner and the planner can use that information to optimize. With any other language, the planner will pretty much treat the function as a blackbox and only utilize the volatile, stable, immutable predicates to determine if it can use a cached answer.
Now what can PLPGSQL or any other PL language in PostgreSQL provide us that our trusty SQL can't? It allows us to write dynamic SQL statements and allows us procedural flow control which is very important in order to create an all purpose solution. If dynamic SQL statements and procedural flow were possible in the regular SQL language function, I would not have had to resort to writing a recursive function. So the problem now is to try to trick the planner into allowing us to use a PLPGSQL set returning argument dependency in the select part.
Originally I had thought that having the planner have visibility into the function would be useful, but since all the difficult sql processing is
happening within the confines of the function, I don't think it helps and actually could be harmful.
So my plan is to use PLPGSQL for the processing and try to compensate for PGSQL's incompatibilites by using the Coated pill AKA Trojan Horse approach.
If the above doesn't work, I'll have to resort to using a bound C function which is a bit out of the scope of my knowledge at this point.
Friday, July 27. 2007
Many a child has looked at a dog chasing its tail and wondered "What is the point of this game? and why do I not have a tail to chase?"
After much time pondering these intriguing questions of life, I have concluded that a dog chases its tail in a hope to catch it.
Once caught, the game then becomes "Can I catch my tail faster than I did before".
In this excerpt, I shall demonstrate the relational database version (PostgreSQL specifically) of the dog chasing its tail.
I call it this because it stirs up such vivid imagery and I have a great fondness for physical imagery that transforms seemingly stale abstract concepts into vibrant creatures.
I suppose that is also a statement of how my mind works: a stream of seemingly unrelated allegories strung together in a bizarre but interesting orchestra.
Getting back to our nearest neighbor problem. We know that expand uses indexes and distance does not. If only we could guess at the smallest expanse that encases
our k neighbors for each reference geometry, we would be in much better shape. We have the same problem as the dog. The faster the dog runs the faster his tail runs away from him. In our case, the larger we make our box, the more likely we would be to trap our neighbors, but then also the more difficult it would be to catalog all those distances of trapped objects.
So how do we guess at this magical expand box? We keep on looking back at the path we have traveled so far and if we have collected enough objects and then stop. We must also observe that our problem can be thought of as a sequence of rhetorical questions that are all asking "Are we there yet?" and that we can solve our problem by simply asking the same question over and over again "Are we there yet? Do we have enough k yet? Until the answer switches from no to YES!
For this particular approach, I am using an sql recursive function that will for each iteration calculate the difference between the current expand box and the previous and then continue with the next expand box if the sum of all these differences collected is still less than our target. Since each nested box is only returning the differential of the previous, it is already pretty much sorted by distance and even more sorted the more slices you slice and dice your problem into. So the cost is the effort to make slices vs. the cost to do distance against a larger slice. We stop calculating at the biggest box that has more neighbors than we need but where the previous expand box had fewer neighbors than we need. This is the solution that I was really hoping the PostgreSQL planner would reduce to with my previous exercises, but sadly didn't.
For this particular approach, we will use the expandoverlap_metric function we created in Battle for the Nearest Neighbor A small success.
Below is what the code looks like for this new function:
CREATE OR REPLACE FUNCTION fntestpolys_nn_evenbetter(gid integer, geom1 geometry, distguess double precision, numnn integer, maxslices integer, it integer)
RETURNS SETOF testpolys AS
$BODY$
--NOTE: it: the iteration we are currently at
--start at the bounding box of the object (expand 0) and move up until it has collected more objects than we need or it = maxslices whichever event ha
ppens first
SELECT thefinalit.* FROM
(SELECT currentit.*
FROM testpolys As currentit
WHERE $1 <> currentit.gid
AND expand($2,$3/$5*$6) && currentit.the_geom
AND expandoverlap_metric($2,currentit.the_geom, $3,$5) = $6
UNION
SELECT nextit.*
FROM fntestpolys_nn_evenbetter($1, $2, $3, $4, $5, ($6 + 1)) nextit
WHERE $1 <> nextit.gid AND $6 < $5 AND
(SELECT COUNT(tp3.gid)
FROM testpolys As tp3
WHERE $1 <> tp3.gid
AND expand($2,$3/$5*$6) && tp3.the_geom) < $4
) thefinalit
ORDER BY distance(thefinalit.the_geom, $2) LIMIT $4;
$BODY$
LANGUAGE 'sql' STABLE;
Here are some test results. The test data used is that which we produced in the Battle for the Nearest Neighbor A small success.
/**Problem calculate the 5 nearest neighbors in table testpolys for the first 500 records **/
--Total query runtime: 9156 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 44063 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 slices
SELECT g1.gid as gid_ref , (fntestpolys_nn_better(g1.gid, g1.the_geom, 10000, 5,100)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 98516 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors
SELECT g1.gid as gid_ref, (fntestpolys_nn(g1.gid, g1.the_geom, 10000, 5)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
This solution is on the order of 10 times faster than the original solution and on superficial inspection, seems to come up with the same answer. Better yet to handle our full 3000 simple polygon it still performs faster than the original when asked to calculate for the whole batch of 3000 simple polygons.
/**Problem calculate the 5 nearest neighbors in table testpolys for all records in testpolys (3000 records) **/
--Total query runtime: 95079 ms, 15000 rows retrieved.
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid) g1;
I guess this final solution really doesn't look much like a dog chasing its tail. It actually looks more like a game of tag to me.
I'm really sorry I lead people astray with this dog chasing tail promise only to reduce it to a game of Who is it?.
Note this doesn't deal with ties so if you have 2 neighbors of equal distance away from your reference object at the tail end, then one may get left out.
|