Forum OpenACS Development: PREPARE considered beneficial

Collapse
Posted by Mark Aufflick on
I am optimizing some tricky iterative sql code, and am seeing some useful speed improvements from using PREPARE and EXECUTE.

It's something that you do without thinking in perl and php becuase the api's provide easy ways to do it, but I can't see an api supported way to do it under aolserver.

Something that has occurred to me though, is that with the query processor, we already have a place that we cache queries (or query parts) by a unique identifier -- why not extend that by preparing expensive queries when they are first called and executing them later.

This could be determined by a flag in the xml tag for the query, since it's not always possible or desirable due to the way we sometimes use xql query fragments etc.

This could very posisbly bring us significant performance boosts.

I am investigating on the postgres mailing lists whether queries inside plpgsql functions are automatically pre-prepared.  suspect they are, so the benefit would be limited to queries outside plpgsql.

Collapse
Posted by Alfred Werner on
I'm not sure how big the "cache" is but you can have multiple statements prepared - I use that fairly often in perl - I wonder if the right way is to just have the query processor "prepare" them all on startup...
Collapse
Posted by Dirk Gomez on
Are you talking PG or Oracle, Alfred? In Oracle it depends on your shared pool size and thus can get really big. Clunky applications like SAP prepare (almost) *all* their statements on startup and then never re-prepare/parse them again.

But does this make a difference for PostgreSQL - it doesn't know bind variables, so how would it know whether a query is identical to another query except in its supplied variable values?

Collapse
Posted by Alfred Werner on
If postgres doesn't handle bind parameters how in the world can a prepare make much of a performance difference, i.e. what difference between
stmt->execute("STMT") vs
stmt->prepare("STMT"); stmt->execute for a single
query instance. In both cases its running the same optimizer & query call right? I guess in the case where the exact same query is used over and over it could help...

Here is an interesting thread I turned up on the topic:
http://www.dbforums.com/showthread.php?t=944729

Also - it does appear, indirectly, that bind vars are supported in prepared statements:
http://archives.postgresql.org/pgsql-general/2002-09/msg01216.php

From my casual browsing of the postgres lists it seems like maybe that was true pre-7.3?

Collapse
Posted by Jeff Davis on
The difficulty with prepared statements is that you need to provide types when you prepare. We could declare them all TEXT, but I am not entirely sure if when you do that whether the resulting query plan is any good (when you give explicit text strings as we do in our fake bind vars it might be smarter than it would be w/o knowing anything).

prepared statements are per db session but for us you would not want to prepare everything since the session would get pretty big, also things like dynamic sort statements and restrictions like you get with listbuilder couldn't be prepared (well, they could but you would need to generate a synthetic key like a hash to identify them rather than using the statement name).

Interestingly, I think you could do it all withough any changes to the db driver itself. Inside db_exec you would take a statement like dbqd.acs-tcl.tcl.file.func.stmnt

select ... from foo, goo, ... where foo.x = :val and goo.y = :val2 ...
change it to first do a
PREPARE dbqd.acs-tcl.tcl.file.func.stmnt (TEXT,TEXT) AS
select ... from foo, goo, ... where foo.x = $1 and goo.y = $2 ...
and have a nsv or something to save the that that statement was prepared on that pool (since PREPARES are per backend) and save the execute statement (which would be static). Then for that and all subsequent calls for that pool you would issue:
EXECUTE dbqd.acs-tcl.tcl.file.func.stmnt (:val, :val2)
to actually retrieve data.

The problem with a blanket system like this is that if you do ddl, the prepared statement does not pick it up. eg:

xarg=# create table tmp(x integer);
CREATE TABLE
xarg=# insert into tmp values(1);
INSERT 254417 1
xarg=# prepare xp as select * from tmp;
PREPARE
xarg=# execute xp;
 x
---
 1
(1 row)

xarg=# alter table tmp add column y integer;
ALTER TABLE
xarg=# execute xp;
 x
---
 1
(1 row)
Also, if you watch files and the query changes you would need to flush prepared queries for those that change (well, probably just mark them dirty and then flush if used).
Collapse
Posted by Dirk Gomez on
I was going through http://search.cpan.org/~rudy/DBD-Pg-1.32/Pg.pm - the docs for the Perl DBI Postgresql driver and it explicitly says this:

prepare

  $sth = $dbh->prepare($statement, \%attr);

    PostgreSQL does not have the concept of preparing a statement. Hence the prepare method just stores the statement after checking for place-holders. No information about the statement is available after preparing it.

(The driver seems to be quite old and it's a rather recent PostgreSQL version (7.3 I think) that has added some kind of prepared statements.)

Mark, on which database do you see the speed improvements?

Collapse
Posted by Mark Aufflick on
Wow - so much discussion!

Postgres version - in this case I'm using 7.4.2, but I have also used prepare in 7.3.4 (not sure exactly what version it appeared in - either 7.2 or 7.3 something)

I first thought of preparing all queries on startup, but discarded that for the obvious reasons. I was imagining a tag in the xql query section for the developer to judiciously indicate "pre-prepare this query". The preparation could be done on first use rather than startup so we don't impact the (already long) startup time.

I didn't consider how to deal with the fact that the prepare statement needs to know the data types. Since I am planning for the developer to have to flag that they want the query prepared, i can also extend the queryset group to have a way to specify the types for a particular query.

Also since it is developer instigated per query, there won't be any problem with non-complete query fragments (like listbuilder queries with filters - just won't be able to prepare them unless we radically change the way filters work. Shouldn't be too hard to make sort by queries prepareable I imagine).

Collapse
Posted by Jeff Davis on
> Shouldn't be too hard to make sort by queries prepareable I imagine).

I am pretty sure that is not true. In order to sort the query itself changes. you can't use a bind variable for the sort columns for example.

I saw a benchmark for doing inserts where going from 30k straight inserts to 1 prepare + 30k executes was ~25% faster. That's a bad example though since the inserts were pretty simple statements so for a complex query you might see something higher than that.

In terms of the timeline, PREPARE/EXECUTE showed up in 7.3 but the libpq support for prepare/execute with bind variables was a 7.4 thing (see PG History).

The 7.4 docs cover PQexecParams and PQexecPrepared which at some point we should start using (although not before we are ready to drop 7.3 compatibility).

Collapse
Posted by Mark Aufflick on
Mmm PQexecParams looks great - couldn't the nsd driver do a version check and use it if it's available?

Anyway, it's not needed for using prepare statements, it will just take the hassle of quoting off our driver for prepared and normal statements.

The case I am using is using a lot of fairly simple statements as well. i am imagining we can get a good speed boost for permissions type queries and complex queries like the crazy ones used by news aggregator.

RE: sorting - maybe you could have a fixed (unsorted) query prepared, and use it as a subselect within a select * from (execute foo) order by foo desc - I haven't tested that form though.

I haven't heard back on the postgres performance list about the effect or usefulness of prepare with queries in pgplsql - I would imagine that internally they would be "prepared" when the plpgsql is compiled, but I don't know.

Collapse
Posted by Alfred Werner on
To stretch it - I wonder if postresql is smart enough to recognize invariant subselects and keep those or if it runs the optimizer each time (in the case of sorted results). If it CAN recognize subselects, an additional layer of nesting might keep things fast (?)

e.g.

select a,b from foo where a=:a, b=:b order by $a_or_b;
select a,b from (select a,b from foo where a=:a, b=:b) order by $a_or_b;

Collapse
Posted by Don Baccus on
PL/pgSQL saves the execution plans for the queries in a function, has forever.

I would rather do "lazy preparing" than prepare all queries on start-up.  That would significantly slow down our already too-slow server restart time.

Collapse
Posted by Don Baccus on
PGExecParams doesn't get us much without prepare. You write queries like this:
select * from foo where key = $1;
So the driver would still have to regexp the query, strip out the bind vars, map them to parameters $1 .. $N, then look up the tcl values and pass them to the query.

The advantage is that value substitution makes each query different for every set of values used with it, where parameter passing of this sort means you'd just PREPARE the query with $1 .. $N parameters.

This is only available with Protocol 3.0 so if we want driver support I think the best way may to be to provide a new driver, because the driver protocols are significantly different I believe and trying to support both in a single driver source would be messy.

Collapse
Posted by Mark Aufflick on
You will be able to achieve that nicely when plpgsql functions returning rowsets will be able to be used as sub-selects.

I think that has made it into the 7.5 plan?

Collapse
Posted by Jeff Davis on
In general you will get a better plan if you give the optimizer the option of flattening subselects so I doubt in general using a plpgsql proc (or smarter subselect plan caching) is the right answer.

One of the most common way we end up with really bad plans is that we have deeply nested views which exceed the count of joins that the optimizer will consider for flattening. It used to be GEQO_THRESHOLD/2 joins (6 I think) in <7.4 but they added a seperate parameter for it in 7.4 -- from_collapse_limit = 8 (docs).