How to force PostgreSQL to use a pre-calculated value

Saturday, April 18. 2009

How to force PostgreSQL to use a pre-calculated value

This question is one that has come up a number of times in PostGIS newsgroups worded in many different ways. The situation is that if you use a function a number of times not changing the arguments that go into the function, PostgreSQL still insists on recalculating the value even when the function is marked IMMUTABLE. I have tested this on 8.2 and 8.3 with similarly awful results.

This issue is not so much a problem if function calculations are fast, but spatial function calculations relative to most other functions you will use are pretty slow especially when dealing with large geometries. As a result your query could end up twice as slow. Even setting the costs of these functions to relatively high does not help the situation.

To demonstrate here is a non-PostGIS version of the issue that everyone should be able to run and demonstrates its not a PostGIS only issue.

CREATE OR REPLACE FUNCTION fn_very_slow(IN param_sleepsecs numeric) RETURNS numeric AS
$$
BEGIN
	PERFORM pg_sleep(param_sleepsecs);
	RETURN param_sleepsecs;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE STRICT ;

--runs in 4524 ms
SELECT fn_very_slow(i*0.5) As firstcall
FROM generate_series(1,5,2) As i;

--runs in 9032 ms - no cache, but in spatial functions (say ST_Distance)
-- we have tried this does sometimes cache and return in 4524ms
SELECT fn_very_slow(i*0.5) As firstcall,
	fn_very_slow(i*0.5)  As secondcallsame
FROM generate_series(1,5,2) As i;


--runs in 9032 ms - no cache
SELECT firstcall,
	firstcall + 1 As secondcalldifferent
FROM (SELECT fn_very_slow(i*0.5) As firstcall
FROM generate_series(1,5,2) As i
) As foo;

Solution:

Our solution to this problem I find kind of ugly, hard to explain, and not ideal. The solution we use is to wrap in a subquery and put an ORDER BY in the subselect. It doesn't seem to matter what that ORDER BY is. You could do ORDER BY 1 and it works though if you have a preferred order, you should use that. The ORDER BY seems to trick the planner into materializing the subselect with the costly function so by the time it hits the main one, it sees the costly calculation as a constant.

Watch what happens when we throw in a meaningless ORDER BY clause

--runs in 4524 ms - caches
SELECT firstcall,
	firstcall + 1 As secondcalldifferent
FROM (SELECT fn_very_slow(i*0.5) As firstcall
FROM generate_series(1,5,2) As i
ORDER BY 1 OFFSET 0) As foo;

Note if you leave out the ORDER BY the planner may or may not materialize the subquery, but ORDER BY seems to almost guarantee it.

Note this is not optimal because for large datasets, you just want the cached result to be reused. You don't want the result to be materialized since you loose the usefulness of indexes.

If anyone has any thoughts on the matter I would love to hear them since this is a big cause of some frustration when you are trying to run spatial queries that have to return in 3 seconds or less.

Posted by Leo Hsu and Regina Obe in 8.2, 8.3, gis, intermediate, postgis, q&a at 22:33 | Comments (8) | Trackbacks (2)

Trackbacks

Trackback specific URI for this entry

Loading and Processing GPX XML files using PostgreSQL
Simon Greener, wrote an article on how to load GPX xml files into Oracle XMLDB. That got me thinking that I haven't really explored all the XML features that PostgreSQL has to offer and to some extent I've been reticent about XML processed in any datab

Weblog: Postgres OnLine Journal
Tracked: Apr 29, 00:23

PostgresQL 8.4: Common Table Expressions (CTE), performance improvement, precalculated functions revisited
Common table expressions are perhaps our favorite feature in PostgreSQL 8.4 even more so than windowing functions. Strangely enough I find myself using them more in SQL Server too now that PostgreSQL supports it. CTEs are not only nice syntactic sugar,

Weblog: Postgres OnLine Journal
Tracked: Jul 17, 02:47

PingBack

Weblog: www.postgresonline.com
Tracked: Jul 17, 12:17

Comments

Display comments as (Linear | Threaded)

It would be straightforward to put a caching function in front of the real function. The caching function would check if there's a precalculated result for the given arguments in the cache, and return the result from the cache if so. Otherwise, run the real function, and put the result in the cache.

#1 Heikki Linnakangas on 2009-04-19 09:39 (Reply)

Heikki,

That's an interesting thought. So are you thinking creating a function such as

fn_cache_me(somekey,fn_slow_function(args))

and the cache would keep say a max of 1000 records and pull out if it sees a match or is there an easier way to do this. somekey the users would guarantee is unique foa given call.

I guess what I'm finding strange is that sometimes it cahces and sometimes it doesn't and not quite sure what controls that.

For example as one users pointed out the construct in PostGIS

ST_Distance(a1,b1) As dist1, ST_Distance(a1,b1) As dist2

almost always caches, though for the above example I gave with pg_sleep it doesn't. Though not sure if that is because ST_Distance is implemented as a C function.

However ST_Distance(a1,b1) + 1 as dist1, ST_Distance(a1,b1) + 2 As dist2 doesn't seem to cache the first call.

#1.1 Regina on 2009-04-19 11:19 (Reply)

Checking for immutability and repeated calls with the same arguments could be an optimization target. Other factors might include the function cost, which I'm guessing is (or should be) set high for at least some of the PostGIS functions.

In 8.4, you'll be able to use CTEs as a way to materialize whole result sets including such function calls.

#2 David Fetter (Homepage) on 2009-04-19 13:06 (Reply)

David,
I tried the immutable and also set function cost high to about 1000 and that didn't seem to help at all in the tests I have run. though it does help in use of && verses costly intersects vs I think what index it applies first when btree indexes are options.

#2.1 Regina on 2009-04-19 17:05 (Reply)

Using "OFFSET 0" instead of "ORDER BY 1" has the same effect, without the overhead of sorting. It's still an undesirable trick-the-planner hack, though :(

#3 moltonel on 2009-04-20 06:08 (Reply)

The main use of the caching is for repeated calls over large sets. So the caching is performed across rows!

\timing
select fn_very_slow(1) from generate_series(1,100) as i;
Time: 1002.019 ms

Even though fn_very_slow is called 100 times the overall time is still just about 1s.

select fn_very_slow(1), fn_very_slow(1) from generate_series(1,100) as i;
Time: 2003.386 ms

Here the function is executed twice (although we are selecting from a 100 row table)

Oracle (9i and 10g at least) behaves exactly the same.

#4 Lars on 2009-04-26 00:38 (Reply)

But what about:
select fn_very_slow(1) from thetable
where fn_very_slow(1) > 0;
(assuming "fn_very_slow(1) > 0" returns all the records)

even in this case fn_very_slow is called twice.

a subquery won't help

select a.theanswer from
(select fn_very_slow(1) as theanswer from thetable) a
where a.theanswer>0;

uses the double query run-time as

select 'constant' from
(select fn_very_slow(1) as theanswer from thetable) a
where a.theanswer>0;

or

select a.theanswer from
(select fn_very_slow(1) as theanswer from thetable) a;

#4.1 Nicklas on 2009-05-04 04:59 (Reply)

Nicklas,

This is pretty interesting - I was totally wrong about the ORDER BY. I forgot that ORDER BY anumber means ORDER BY the first column in the SELECT and not a constant. So guess I should change this to OFFSET 0 as moltonel suggested. Do you get these timings?

But as Lars pointed out -- it is caching across rows not columns since its not calltime*numrows but rather calltime*numcols unless you use the OFFSET hack.

-- 1000 ms
select 'constant' from
(select fn_very_slow(1) as theanswer from testi where i> 0) a
where a.theanswer > 0;

-- 2000 ms
select theanswer from
(select fn_very_slow(1) as theanswer from testi where i > 0) a
WHERE a.theanswer > 0;

-- 1000 ms
SELECT a.theanswer from
(select fn_very_slow(1) as theanswer
from testi WHERE i > 0 OFFSET 0) a
where a.theanswer > 0;

--2000 ms
SELECT a.theanswer from
(select fn_very_slow(1) as theanswer
from testi WHERE i > 0 ORDER BY 1) a
where a.theanswer > 0;

#4.1.1 Regina on 2009-05-04 07:38 (Reply)

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Phone* What is nine minus nine?
	Remember Information? Subscribe to this entry

How to force PostgreSQL to use a pre-calculated value

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Saturday, April 18. 2009

How to force PostgreSQL to use a pre-calculated value

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

How to force PostgreSQL to use a pre-calculated value

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Saturday, April 18. 2009

How to force PostgreSQL to use a pre-calculated value

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting