Forum OpenACS Development: Query optimization

Collapse
Posted by Don Baccus on
Throughout the ACS 4.2 codebase we inherited you'll find queries like this: select * from foo where foo.key in (select key from bar where ...); These queries are very inefficient in PostgreSQL. My presumption has been that aD adopted this style of query based on solid information that Oracle would properly optimize such queries (in principle they can be transformed into a sort of "join-exists" that can't be expressed in SQL directly, i.e. turn it into a join but stop the join as soon as the "in" condition is satisfied). But recently Janine's found some information on an Oracle discussion list that claims the opposite, i.e. that Oracle, like Postgres, does no such optimization but rather computes the subselect then performs the "in" operation on the returned rowset, guaranteed in many cases to be extremely slow. The one case that triggered the response I've included below showed the "in (subselect)" construct taking 8 seconds with the outer join taking 0.11 seconds (the argument applies to inner joins as well). Does anyone here with extended Oracle experience have any more data to share on this matter? The toolkit is still full of such queries. These certainly need to be weeded out of the Postgres version for our next release (no way we're going to hold up this release for such wholesale changes). The question is whether or not to do so for Oracle. There's a strong maintenance argument in favor regardless (we can continue to share many more queries if we switch to the join equivalent for both RDBMSs) but I'm curious as to whether or not there's also a strong performance argument in favor for Oracle (the argument for PG is a given)
.


>Outer joins will almost always perform better due to one simple fact:
> SQL statements in parenthesis are performed separately from the
calling SQL
> statements.  So a sql statement like this:
> select * from table_a A where A.column_a not in (select B.column_b from
> table_b B where B.column_b = 22);
> the way this executes is the select B.column_b ... is execute first.
 Then,
> the results from this statement are used almost like an array with
the first
> sql statement.  No indexes are used whatsoever in this kind of
query, as the
> Oracle optimizer can not figure out which index is appropriate.  If
you look
> at these queries in Oracle Database Studio (there's a way you can
see how the
> optimizer works on the query's, as well as costs - it's under session
> information, and the last run SQL), you'll see how this works.
> Outer joins however, can be optimized, as they're run as one single sql
> statement, that get's optimized.
> Cheers!
Collapse
Posted by John Mileham on
I'm frankly surprised that there are many in(subselect) queries in the core, as it was common knowledge around aD when I was there (straight through the release of 4.x) that in = bad, exists = good.  I can't provide a behind-the-scenes look at Oracle's behavior in these cases, but I've seen it in action on complex queries on Oracle (hacking on photo.net).  Not only are "in" queries immensely slow relative to "exists" queries, but they will error if the result set of the subselect is too large because it has to store the returned rows instead of streaming them through the rest of the query in parallel.
Collapse
Posted by Don Baccus on
Hmmm, interesting feedback.

It is clear when reading the code that avoiding JOINs in favor of "in (subselect)" or "= (subselect)" was an intentional strategy when the ACS 4 codebase was developed.  And not just the core, in application packages as well.  These two subselect constructs are used almost literally whenever possible.

This wasn't true in the ACS 3.x codebase.

Given the fact that use of such constructs seems clearly intentional, I've assumed aD knew for certain that Oracle would optimize them into the (theoretically possible, but not expressable in SQL) "join-exists"
style operation I mentioned above.

While apparently it is becoming clear that Oracle does not?

(I'm more familiar with the innards and details of PostgreSQL, and know for certain both subselect-style constructs are a poor choice in PG).

I'm doing a little digging around in the aD developer forum, and the best I can come up with thus far is that joining against complex views (like the permissions/party view) is handled poorly by Oracle and thus the "in" form may be faster.  This, though, assumes that you can "slice" the view down to reasonable size in your "where" clause.

Thinking in terms of another thread, this may be one reason why so many variants of "read" (etc) perms are created, i.e. one-per-package.  Using the above-described hack to get around slow JOINs on the permissions view works better if you can "slice" a small piece out of the generalized party/permissions view, and the more variants of "read", "write" etc you have the more likely it is that such a slice will be small.

In another words a couple of hack approaches to get around what might be fundamentally a relatively slow design ...

Collapse
Posted by Jonathan Ellis on
It's kind of a shame because to my mind the exists form is both more concise and easier to read than joins that need group by and having clauses.  Oh, well...
Collapse
Posted by Don Baccus on
My quibble is not with "exists" which in general is a very good way to write a query, just queries like
select *
from foo
where foo.key in (select bar.key
                  from bar
                  where bar.key = :something)
Apparently the above is better than a join if "bar" is actually a complex view in some cases.
Collapse
Posted by Jonathan Ellis on
I'm following you... but even where it's more a "real" exists like
select u.user_id, last_name, email, priv_email, url
from users u
where exists (select 1 from bf2_party_stats_mv p
              where p.user_id = u.user_id and score > 1)
(note each user can have multiple parties) it's a factor of 10 or so slower than
select u.user_id, last_name, email, priv_email, url
from users u, bf2_party_stats_mv p
where u.user_id = p.user_id
group by u.user_id, last_name, email, priv_email, url
having sum(score) > 1
Here, party_stats_mv is extremely volatile so I have no indexes on it and the double sequential scan PG does with the exists query kills it. However, similar queries on tables where I can have an index on the join key show the exists form to be faster by 50-100%. But, the exists form is about 20% slower for count(*) queries whether or not an index exists.

(Tested with 1.2k rows in users and 1.6k in parties.)

Collapse
Posted by Don Baccus on
What happens if you do declare an index on user_id, even though the table's volatile?  It will slow inserts and updates but should make selects go much faster.

Anyway, I'm glad that exists is fast when you've tested with an index!

Collapse
Posted by Jonathan Ellis on
right...  there's way more updating going on on that table than selects.  selects pretty much only gets hit when someone wants to see the high scores. :)
Collapse
Posted by Don Baccus on
Well, then, in a sense you've answered your own question, right?  You've decided to cripple the optimizer in the "select" case in order to speed up the "insert/update" case.

Perfectly OK but it's not really a true measure of the efficiency of one  form of query over another because you're not optimizing your datamodel for that query (the select) but rather for dml statements.

It is a great example of the trade-offs involved when you do traffic analysis, though!  In this case you know that there are many "insert/update" dml statements and rather few "select"s.

What you may not know is that PG is relatively poor, performance-wise, at "update" dml statements.  So your dropping of the index is helping to mask that.

(this isn't a PG knock, PG's extremely good at other things and insert/update performance is near top of the list for future banging in PG 7.3)