Forum OpenACS Q&A: Faster Selects with study of EXPLAIN

I need to optomize some of my scripts and I want to understand what is fastest. Here is an example of getting data from two unjoined tables and from both as a joined select. It appears that the two separate queries are faster by the numbers. (rounded:)

-- Separate: 320+192 = 512
-- Joined: 2440+320+1836 = 4596 

---1--
explain select * from users where user_id =55;

NOTICE:  QUERY PLAN:
Index Scan using users_pkey on users
(cost=0.00..319.91 rows=226 width=992)
EXPLAIN

---2--
explain select * from users_contact where user_id=55;

NOTICE:  QUERY PLAN:
Index Scan using users_contact_pkey on users_contact
(cost=0.00..192.23 rows=226 width=520)
EXPLAIN

---joined---
explain select * from users as u, users_contact as c
where u.user_id=c.user_id and u.user_id=55;

NOTICE:  QUERY PLAN:
Merge Join  (cost=0.00..2440.39 rows=50925 width=1512)
  ->  Index Scan using users_pkey on users u
       (cost=0.00..319.91 rows=226 width=992)
  ->  Index Scan using users_contact_pkey on users_contact c
       (cost=0.00..1835.59 rows=22566 width=520)
EXPLAIN

It appears from these relative numbers that the joined select takes almost 9 times longer than doing the two separate selects. I know that there are cases where a join is necessary but is it best to not join where possible?

Also, I haven't tested views, but I wonder if a view of joined tables is just as fast as an unjoined table.

I look forward to comments that will help make my queries faster than a locomotive, and able to leap tall buildings in a single bound!

-Bob

Collapse
Posted by Jonathan Ellis on
if you are running 7.2 try running "explain analyze" which actually executes the query so you're not just guesstimating.
Collapse
Posted by MaineBob OConnor on
I'm using version 7.1.2 and this may be a compelling reason to upgrade to 7.2
Collapse
Posted by Don Baccus on
First of all, the cost estimates should not be used to attempt to decide which of the two approaches is faster.

If you look at the estimate on the cost of the select on the users_contact table in the JOIN query you'll see it is estimating that roughly ten times as many rows will be returned than in the non-join case.  This is clearly bogus, as only one row will be returned for the entire query.  That 10x factor right there explains why EXPLAIN makes you think the query will run at least 9x slower than doing the two separately.

These cost estimates are only used by the optimizer to decide what kind of scans and joins to use, they're not intended for anything else.  The estimates are known to be relatively obtuse as they are based on very minimal statistics kept for the table.  PG 7.2 keeps additional statistics which should help the optimizer make better decisions, but the underlying notion that the cost estimate is just an estimate has not changed.

In general a join with a good plan will run faster than two queries.

Stacking the deck further against breaking your queries up, two queries mean two round-trips from the driver to the DB, two parses, two optimization passes, two executions with two resultsets being passed back to the drive, stuffed into Tcl vars, and then read by your code.

As far as your view question goes, PG executes views dynamically so normally the execution time will be roughly the time taken for the query written out by hand.

Collapse
Posted by James Thornton on
Have you tried using PG 7.1 explicit JOINs: http://www.postgresql.org/idocs/index.php?explicit-joins.html
Collapse
Posted by Jun Yamog on
Thanks James for the added info.  Now I have an idea why I use those join....hehehe.

Anyway it seems that explicit joins are good, is it advised that OpenACS use explicit joins if possible?

Collapse
Posted by James Thornton on
Quote from Tom Lane:

"Postgres absolutely does not care [about table order]: the optimizer will always consider both A-join-B and B-join-A orders for every join it has to do. [Y]ou can constrain the join pairs the optimizer will consider if you use explicit-JOIN syntax --- but each pair will be considered in both directions".

Collapse
Posted by Don Baccus on
No, I do *not* recommend using the explicit JOIN syntax as a general rule.  For one thing, as Tom Lane's note indicates, the optimizer doesn't fully optimize complex JOIN expressions in the FROM list.  For another the syntax isn't supported by Oracle 8i and we want to maintain portable queries as much as possible, so gratuitous differences are a no-no.

IMO the optimization issue is a bug but thus far the PG folk seem to take the point of view that it's a feature.  "You can constrain the optimizer by writing explicit JOINs or get full optimization by writing implicit JOINs!"  Hogwash.  In an ideal world all JOINs would be written in the FROM list, as this is the preferred SQL92 standard way to do it, but if the optimizer's going to punt on them then it isn't a good idea.  The reason the optimizer punts is because outer JOINs aren't associative, and the code doesn't differentiate between outer and inner JOINs when optimizing (i.e. it treats inner JOINs as though they're not associative).  The PG folks are used to the implicit join syntax and thus seem to think the only reason folks might use it is in order to take "advantage" of the fact that there's only limited optimization done on the explicit joins.

So ... the bottom line is that inner JOINs written in the FROM list are only optimized in a limited way.  Given that the best join order will vary depending on the size of various tables, in general it isn't possible to write queries to explicitly express the "best" way to JOIN, because the "best" will vary from installation to installation.  An installation with lots of users and few groups, for instance, may perform better with joins against parties and relationships done in a different order than an installation with lots of groups relative to the number of users.

Collapse
Posted by Jun Yamog on
Ok god has answered our prayers.  I just hope god or his minions will carve his word in the documentation.

I hope this statement of using explicit joins in postgres should not be used if there is no good reason to use it.