Forum OpenACS Development: acs_object__new slowness...

Collapse
Posted by Jeff Davis on
I think I have tracked down the slow object create problem on postgres. The problem is that creating the tree_sortkey is quite slow for things with lots of siblings...

Here is the code which generates the tree_sortkey:


create function acs_objects_insert_tr () returns opaque as '
declare
        v_parent_sk     varbit default null;
        v_max_value     integer;
begin
        if new.context_id is null then 
            select max(tree_leaf_key_to_int(tree_sortkey)) into v_max_value 
              from acs_objects 
             where context_id is null;
        else 
            select max(tree_leaf_key_to_int(tree_sortkey)) into v_max_value 
              from acs_objects 
             where context_id = new.context_id;

            select tree_sortkey into v_parent_sk 
              from acs_objects 
             where object_id = new.context_id;
        end if;

        new.tree_sortkey := tree_next_key(v_parent_sk, v_max_value);

        return new;
end;' language 'plpgsql';

I created a site with about 90k users (and not much content) and adding users was taking about 4 seconds per user. and the majority of that time was spent generating the tree_sortkey for the user and the users membership_rel for the site. The user has context_id null and the membership_rel has context_id -2 and here are the query plans for these two circumstances:

clean=# explain analyze select max(tree_leaf_key_to_int(tree_sortkey)) from acs_objects where context_id = - 2;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4994.24..4994.24 rows=1 width=12) (actual time=2244.36..2244.36 rows=1 loops=1)
   ->  Seq Scan on acs_objects  (cost=0.00..4763.93 rows=92126 width=12) (actual time=0.16..294.26 rows=90053 loops=1)
         Filter: (context_id = -2)
 Total runtime: 2244.40 msec
(4 rows)
clean=# explain analyze select max(tree_leaf_key_to_int(tree_sortkey)) from acs_objects where context_id is null;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4532.39..4532.39 rows=1 width=12) (actual time=1892.85..1892.85 rows=1 loops=1)
   ->  Seq Scan on acs_objects  (cost=0.00..4305.74 rows=90660 width=12) (actual time=0.03..263.70 rows=92624 loops=1)
         Filter: (context_id IS NULL)
 Total runtime: 1892.90 msec
(4 rows)
I created functional indexes for tree_leaf_key_to_int(tree_sortkey):
create index tmp_ts_idx on acs_objects(tree_leaf_key_to_int(tree_sortkey));
create index tmp_ts_null_idx on acs_objects(tree_leaf_key_to_int(tree_sortkey)) where context_id is null;
create index tmp_ts_reguser_idx on acs_objects(tree_leaf_key_to_int(tree_sortkey)) where context_id = -2;
but none of them were used unless I did "set enable_seqscan to off;" and in that case they were typically just as slow as the table scans (not unsurprisingly since they are both ~50% of the table).
clean=# explain analyze select max(tree_leaf_key_to_int(tree_sortkey)) from acs_objects where context_id is null;
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=242645.75..242645.75 rows=1 width=12) (actual time=1847.50..1847.50 rows=1 loops=1)
   ->  Index Scan using tmp_ts_null_idx on acs_objects  (cost=0.00..242414.19 rows=92624 width=12) (actual time=0.13..306.28 rows=92624 loops=1)
         Filter: (context_id IS NULL)
 Total runtime: 1847.56 msec
The fundamental problem is that walking through an objects siblings to find the max(tree_leaf_key) is very slow. we can mitigate the problem somewhat by making sure things have a context_id that limits the number of siblings but for users the registered_users context will always have a lot of siblings and so user create will remain slow.

On openacs.org here are the contexts with lots of children:

openacs.org=# select *, acs_object__name(context_id) 
  from (select count(*), context_id from acs_objects group by context_id having count(*) > 100) a;

 count | context_id |               acs_object__name
-------+------------+-----------------------------------------------
   111 |       -400 | Trash
   105 |       -100 | Pages
  7024 |         -2 | Registered Users
 27304 |      14013 | OpenACS Q&A
  9725 |      14014 | OpenACS Development
   307 |      14015 | OpenACS CMS
  1292 |      14016 | OpenACS Testing
  2725 |      14017 | .LRN Q&A
  5090 |      46408 | CVS Commits
  1479 |      56106 | OpenACS Bugs
   260 |      73798 | #OpenACS IRC Log
   180 |      98190 |  Bookmarks Root Folder of shermira@wanadoo.es
   329 |     115570 | OpenACS Improvement Proposals (TIPs)
   125 |     137862 | doc/openacs-HEAD/
 25391 |            |
(15 rows)
One reason forums is still somewhat slow is that all messages have the context of the forum and so a post has to walk through the 1000s of rows to determine the max sibling. Another problem is the apm_package__new slowness, this happens because parameter_values are objects and have context_id null; Each parameter takes a while to create as it has to traverse the items which also have context_id null.

Again on openacs.org here are the objects created with no context_id:

openacs.org=# select object_type, count(*) from acs_objects where context_id is\
 null  group by object_type having count(*) > 10;
      object_type      | count
-----------------------+-------
 acs_sc_implementation |    22
 acs_sc_msg_type       |    46
 acs_sc_operation      |    19
 apm_package_version   |    50
 apm_parameter         |   176
 apm_parameter_value   |   800
 content_extlink       |   167
 content_revision      |   142
 etp_page_revision     |   767
 notification          |   215
 notification_request  | 14067
 site_node             |   118
 url                   |  1644
 user                  |  7024
 user_portrait_rel     |    82
(15 rows)
Notifications and Users here are the standouts but on a dotlrn site I think you can end up with a pretty large number of parameter_value entries as well (because there are a lot of params and a lot of site nodes).

Collapse
Posted by Jeff Davis on
One datapoint: I modified apm_package__new to use the context_id of the package for the parameters and now creates of departments on dotlrn takes 8s rather than 80s.
Collapse
Posted by Don Baccus on
Jeff ... good call on the parameters.

You've correctly identified the bottleneck as being the creation of tree_sortkey.  The general fix is to add an additional integer column called (say) "max_leaf" to each node , which can be used directly, avoiding the need to traverse all the children of a node when adding a new child.

This needs to be done for all types which carry tree_sortkeys, and should be done for 5.0.

There are still concurrency issues which AFAIK are unsolvable in today's PG because locking within PL/pgSQL doesn't work right (which I define as "the way Oracle does it" in this case).  Locking has to be done at the application level to work the way one expects and this is impractical for a system like OpenACS.  Fortunately in practice it's not clear anyone's actually run into the concurrency bug ...

Collapse
Posted by Jeff Davis on
It looks like for things that have context_id null using a sequence to generate the root tree_sortkey gives roughly an order of magnitude speedup (and for context_id -2 I found a similiar result).

You have to add a sequence as well as a "max_leaf" column so that you can generate the root tree_sortkey when context_id is null; that or create a root object that would always exist -- which I guess is what security_context_root is.

The question is is it preferable to force everything to be a child of security_context_root or add a sequence?

Adding the sequence and extra column are straightforward (and if anything substantially improve the locking issues since the resulting queries are so much faster). here was the new acs_objects_insert_tr that I used with a sequence to generate the root context tree_sortkey.

create sequence acs_object_root_sortkey;
select setval('acs_object_root_sortkey',
  (select max(tree_leaf_key_to_int(tree_sortkey)) 
   from acs_objects 
   where context_id is null));

create or replace function acs_objects_insert_tr () returns opaque as '
declare
        v_parent_sk     varbit default null;
        v_max_value     integer;
begin
        if new.context_id is null or new.context_id = -2 then 
	    select nextval(''acs_object_root_sortkey'') into v_max_value from dual;
        else 
            select max(tree_leaf_key_to_int(tree_sortkey)) into v_max_value 
              from acs_objects 
             where context_id = new.context_id;
            select tree_sortkey into v_parent_sk 
              from acs_objects 
             where object_id = new.context_id;
        end if;

        new.tree_sortkey := tree_next_key(v_parent_sk, v_max_value);

        return new;
end;' language 'plpgsql';

(the context_id = -2 bit is a hack but is safe and I wanted to see how faster user creation would be with it since that is where I originally started from -- the answer is user creation on a system with 100k users takes 175ms as opposed to 3750ms w/o this change).

I am willing to fix it for 5.0 but I think it needs OCT approval and will write a TIP for it...

Oh, and some information on where the speedups will show up. Malte generated this from aiesec (which has ~5M objects and 850k with null context_id) and you can see the ones with null context_id:

  COUNT(*) OBJECT_TYPE
---------- ---------------------
    152201 acs_activity

    126381 apm_parameter_value
     11079 cal_item
     48268 calendar
      8138 mailbox
      4654 notification
    205921 notification_request
     45549 portal
    146999 portal_page
      5097 rel_segment
     23760 site_node
      3866 static_portal_content
    113891 user
So creation of users, site nodes, packages, notification requests, calendar items and portal pages should all be significantly improved.
Collapse
Posted by Andrei Popov on
<blockquote> The general fix is to add an additional integer column
called (say) "max_leaf" to each node , which can be
used directly, avoiding the need to traverse all the
children of a node when adding a new child.
</blockquote>

But wouldn't it require additional time to update it (albeit this may still be better than current state)?

Maybe Oracle-like CONNECT BY for Pg should be given a try?

Collapse
Posted by Don Baccus on
Originally I'd planned to use the security object (-4) to store the root leaf value when the context_id was null but ...

Is there any reason you couldn't just use the object id in this case?  It is unique and increases, the two attributes needed for the sortkey.

The only reason for generating adjacent keys at each level is to save space for skinny, deep trees (8 bits per level) ... clearly on big systems there will be more than 127 nodes at the root level so this isn't much of an issue for objects which have the root or null as parent!

Remember we need well-tested upgrade scripts if you're going to squeeze this into 5.0.

Collapse
Posted by Jeff Davis on
Don, I like using the object_id but the problem is updates on an existing site where the heirarchy was created with the old style key mechanism
Node  ObjID   Context SortKey
A     1       null    1
  B   2       1       1.1
C     3       null    2
If you wanted to update B's context_id to null you couldn't assume it's object_id was available since it had been used by C for its sortkey.

Of course on insert it should be ok (although it's conceivable that if a site juggled it's context_ids around enough the tree_sortkey could have walked past the object_id sequence but I don't think that it has ever or would ever happen). To be certain and safe though you would need to bump the object_id past the max tree_sortkey if it had, and when updating context_id to null you would need to take a nextval(object_id) although use it just for the sortkey rather than as the actual object_id.

Collapse
Posted by Andrew Piskorski on
Don, could you say more about Postgres's "locking within PL/pgSQL doesn't work right" problem? I hadn't heard of that before and would like to understand how/why it differs from Oracle, what problems it can cause, what the outlook might be for Postgres fixing it, etc.

Hm, is that concurrency problem related to the race condition Patrick M. was talking about back in Feb. 2003? (Which was fixed since then, I hope?)

Also, a basic OpenACS data model tutorial question: In the Postgres version the acs_objects table has this tree_sortkey column, while the Oracle version appears to have no equivalent column. Why? Does the Oracle version just use a connect by on the context_id column somewhere, while for Postgres the tree_sortkey provides the same information? Or something completely different? (This had probably been answered several times elsewhere, but if so I couldn't find it...)

Collapse
Posted by Jeff Davis on
Andrew, No, it has not been fixed yet and it (and thanks for pointing out Patrick's solution). I think there is still a locking problem with Patrick's code (not based on anything but hearsay though) but the more commonly encountered problem, besides the performance one that got me looking into it, is the duped tree_sortkeys. There are some places where code like Patricks has been implemented (forums, dotlrn_communities_all, dotlrn_community_types all maintain a max_child_sortkey in the parent) but it's not in: sp_folders, bm_bookmarks, acs_messages, cr_keywords, cr_items, site_nodes, acs_privilege_hierarchy_index, acs_objects, and acs_object_types.

Yes, most of the oracle datamodel uses connect by for this rather than tree_sortkey but they are roughly equivalent. The tree_sortkey stuff has been implemented in oracle is used in dotlrn (for the dotlrn_communities_all and dotlrn_community_types tables). My gut tells me tree_sortkey is a better answer for heirarchies that are mostly static but I am not aware of anyone having benchmarked things.

Collapse
Posted by Don Baccus on
Patrick's code doesn't solve the concurrency problem, through no fault of his own, though.  If the lock (select for update) is executed at the application level it will, but it doesn't help within PL/pgSQL.  If another process has already entered the trigger its "select for update" won't block.

Ain't it lovely?  I'm sure I've posted details to the forums before.

Now if PG locking worked right (like Oracle!) the code would be correct.

Collapse
Posted by Don Baccus on
Actually I was thinking of changing *all* first-level keys to use the object's object_id, though that's not what I said :)  Changing to this scheme would fix the null context_id problem automatically, that's what I'm thinking.

So your A,B,C nodes would all have their object_id as their first-level sortkey ...

Now doing the update may still be difficult because update's in PG aren't (logically) atomic so you can have unique constraint errors generated mid-update so any update script has to be very careful, I don't think you can just do one big UPDATE statement.

But this is the most attractive solution I've been able to think of ...

Collapse
Posted by Jeff Davis on
I think I could write a script that would build the new tree sortkeys in 2 or 3 statements and flop them in without a problem. (basically you pull the current ones out and renumber in a tmp table, set all the current ones to the id, then update the ones with context_id not null which should be robust, reasonably fast, and not at all difficult). Oh, drop the triggers first as well...

I suppose I should write a TIP for it since it should be approved (and I am not sure if people want this to go in for 5.0 or not). It does make a big difference difference (and is essential for anything openacs.org size or bigger) and I don't think the code or the upgrades are at all difficult but I know people want 5.0 to ship as well.

Btw, do table locks in plpgsql have the same problem as select for update?

Collapse
Posted by Don Baccus on
When I ran my experiments I went all the way up to "LOCK TABLE".

However, now that I think about it ... doing a "SELECT FOR UPDATE" in two separate parallel calls to a function may actually work correctly IN THIS CASE, because we know the key exists at the beginning of each transaction that fires the trigger.  Thus both see (and lock) the same tuple.  Good news - I think :)

When I found the locking problem I was looking at another problem - the denormalized party members table I added in 4.6 that helped speed up permissions checking so much.

Check out the Oracle and Postgres implementations - you'll see that in Oracle I keep one party/membership pair and count the number of times a row's generated through group membership (multiple instances are a natural result of the groups datamodel and implementation).

In Postgres I don't count, I add the duplicate rows, adding to the space burden that results from this denormalization (but not materially slowing down permissions checking).

The reason is that in PG you can't solve the following problem entirely within PL/pgSQL:

If a tuple with key "k" exists in table "t", UPDATE its counter; otherwise INSERT a new tuple with key "k" and counter set to zero.

Logically you can do this (no exception handling in PG, remember):

LOCK t;
UPDATE t SET counter = counter + 1 WHERE key = 'k';
IF (no rows updated) THEN
  INSERT INTO t VALUES ('k', 0);
END;
UNLOCK t;

(in practice you only need a write lock of course)

Buried in PL/pgSQL this code will generate duplicate rows in "t" with key "k".  Done at PSQL it will not (of course!).

The problem is that a statement like "SELECT foo()" is executed atomically - statements inside PL/pgSQL don't change tuple visibility in other transactions because they're not considered "real commands".

So two calls to the above snippet can result in the following (call the two threads "A" and "B"):

A. locks t
B. waits for t because A has locked it
A. tries update, no tuples modified (row doesn't exist)
A. inserts row with key 'k'
A. unlocks table
B. resumes execution
B. tries update - NO TUPLES ARE MODIFIED because it doesn't
see the row inserted by A above!  Because the tuple set visible to the function is computed when its parent "SELECT foo()" begins execution, changes by other transactions aren't visible anywhere within the statement.

F**K!

Of course from PSQL each statement bumps the command counter (counts as a "real command") and doing the above from two PSQL sessions results in B "seeing" the tuple inserted by A, once A releases its lock and B resumes execution.

Oracle, on the other hand, makes tuples from other transactions visible within PL/SQL in such cases, in triggers and procs at least (maybe not functions - see below).

Now one can argue that PG is doing the correct - though damned inconvenient - thing here because a SELECT statement is *supposed* to operate atomically on a set of tuples.  We don't have procedures in PG, just functions, and we don't have triggers distinct from functions (in other words, PG triggers call functions rather than contain their on block of code).

One could argue that functions shouldn't have side effects that bring up issues like tuple visibility (since functions with side effects are "unclean" in theory anyway) but since PG procs don't exist and PG triggers don't have their own block of code we're *forced* to write "unclean" functions where we very, very much want side effects to break the atomic nature of the calling SELECT statement ...

Now at the application level we could call LOCK before such function calls but it is impractical to tell developers LOCK all tables used by all functions called directly or indirectly by triggers that are modified in ways that could cause race conditions within that code.  Right? :)

I haven't raised this issue with the PG crowd because, as I've said, from one point of view the behavior's correct and because we have, well, philosophical differences in some areas and I've gotten tired of trying to explain language implementation issues to them in a way that they can understand.  They tend to be inconsistent and unpredictable.

Collapse
Posted by Don Baccus on
In case the rant above causes anyone to miss my first point ...

the SELECT FOR UPDATE is correct ... and I think will actually work in this case because I don't think it hits the locking problem I've been talking about.

So if you:

1. Add the appropriate SELECT FOR UPDATE
2. Add the tracking of the maximum child key to objects
3. Switch to using object_id for the first level sortkey for
all objects (including those with NULL context_id)

I think we're home.

Now the odd question - why are objects with context_id set to the security root considered to be at level 0 rather than level 1, because the security root object itself is at level 0 (and has a context_id set NULL).

Answer - I'm not sure.  Dan Wickstrom wrote the original version of the tree sortkey code after talking to me, including the triggers.  Later I wrote the code to use varbit rather than char as the datatype to store the sort keys, including the short key space saving hack, but I didn't change anything else.

Logically it would seem to me that the all objects with context_id set NULL - including the security root object - should be level 0, while all other objects - including those with security root object as their context_id - should have level > 0.

No?  Comments, Jeff?

I really want to get this right once and for all, which is why I'd earlier decided to forget about it until 5.0 was out.

Collapse
Posted by Jeff Davis on
Ok, select for update it is. I will do some concurrency testing as well and see if I can break it.

If by level 0 you mean null, then I think there is a good reason for things with context_id null to have a non-null tree sortkey.

Say you want all the parents of e given a heirarchy

  
Node   level0    level1
a      Null      1
  b    1         1.1
c      Null      2
  d    1         2.1
    e  1.1       2.1.1
if a and c are level 0 then as you walk up you have no way of picking out a or c as the parent of e, whereas in the case of all tree_sortkeys starting at level1 it's just the same as the other parents...

I also think given how many are null (~20% for realistic data sets -- a problem in itself really) the optimizer would have better statistics if there are no null tree_sortkeys. although I guess with postgres partial indexes can make that much less of a problem.

Collapse
Posted by Don Baccus on
I was counting from zero, you're counting from one.  So we're saying the same thing as far as that goes.

Ignore the rest, I misunderstood how nodes with the security context root as parent are being treated, but went off and looked at the code.  It's doing what I want.

Collapse
Posted by Jeff Davis on
I tracked down the last bottleneck with creating users (and it ends up not being a problem with acs_objects). The problem is with the rel_constraint__violation check in membership_rels_in_tr. The second query in that function is very fast but the first gets much slower as users are added to the system. Here it is:
 for constraint_violated in
     select constraint_id, constraint_name
       from rel_constraints_violated_one
       where rel_id = violation__rel_id
       LIMIT 1 
    LOOP
This is checking for violated rel_constraints with rel_side = 'one' and takes about 2 seconds on my system even though rel_constraints only has one entry (and that with rel_side 'two'):
 constraint_id |               constraint_name               | rel_segment | rel_side | required_rel_segment 
---------------+---------------------------------------------+-------------+----------+----------------------
        924494 | Members of foo must be members of Main Site |      924493 | two      |                  432
In fact I can't find anywhere in the code that anything with rel_side one ever gets created nor does openacs.org or any other site I checked have rel_side 'one' entries.

In any case, the query plan for that seemingly innocuous query is entertainingly horrible:


                                                                                        QUERY PLAN                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=2.04..16701.25 rows=1 width=262) (actual time=1208.71..1208.71 rows=0 loops=1)
   Merge Cond: ("outer".container_id = "inner".element_id)
   Join Filter: ("inner".segment_id = "outer".required_rel_segment)
   Filter: ("inner".element_id IS NULL)
   ->  Nested Loop  (cost=1.02..16310.71 rows=1 width=159) (actual time=1208.71..1208.71 rows=0 loops=1)
         Join Filter: (("inner".object_type = "outer".rel_type) AND ("inner".tree_sortkey >= "outer".tree_sortkey) AND ("inner".tree_sortkey <= tree_right("outer".tree_sortkey)))
         ->  Nested Loop  (cost=1.02..16307.39 rows=1 width=132) (actual time=1208.70..1208.70 rows=0 loops=1)
               Join Filter: ("inner".object_type = "outer".rel_type)
               ->  Nested Loop  (cost=1.02..16304.40 rows=1 width=105) (actual time=1208.70..1208.70 rows=0 loops=1)
                     Join Filter: ("outer".group_id = "inner".group_id)
                     ->  Index Scan using group_elem_idx_container_idx on group_element_index  (cost=0.00..16301.51 rows=2 width=26) (actual time=1067.93..1201.84 rows=2 loops=1)
                           Filter: (rel_id = 129804)
                     ->  Materialize  (cost=2.06..2.06 rows=1 width=79) (actual time=3.43..3.43 rows=0 loops=2)
                           ->  Hash Join  (cost=1.02..2.06 rows=1 width=79) (actual time=6.85..6.85 rows=0 loops=1)
                                 Hash Cond: ("outer".segment_id = "inner".rel_segment)
                                 ->  Seq Scan on rel_segments rs  (cost=0.00..1.02 rows=2 width=23) (actual time=0.03..0.04 rows=6 loops=1)
                                 ->  Hash  (cost=1.01..1.01 rows=1 width=56) (actual time=6.69..6.69 rows=0 loops=1)
                                       ->  Index Scan using jcd_rel_constraint_relside on rel_constraints rel  (cost=0.00..1.01 rows=1 width=56) (actual time=6.68..6.68 rows=0 loops=1)
                                             Index Cond: (rel_side = 'one'::bpchar)
               ->  Seq Scan on acs_object_types o2  (cost=0.00..2.44 rows=44 width=27) (never executed)
         ->  Seq Scan on acs_object_types o1  (cost=0.00..2.44 rows=44 width=27) (never executed)
   ->  Materialize  (cost=3742888.01..3742888.01 rows=67268 width=103) (never executed)
         ->  Nested Loop  (cost=1.02..3742888.01 rows=67268 width=103) (never executed)
               Join Filter: (("outer".tree_sortkey >= "inner".tree_sortkey) AND ("outer".tree_sortkey <= tree_right("inner".tree_sortkey)) AND ("outer".group_id = "inner".group_id))
               ->  Nested Loop  (cost=0.00..1804822.87 rows=598630 width=53) (never executed)
                     Join Filter: ("inner".object_type = "outer".rel_type)
                     ->  Index Scan using group_elem_idx_element_idx on group_element_index  (cost=0.00..14919.17 rows=598630 width=26) (never executed)
                     ->  Seq Scan on acs_object_types o1  (cost=0.00..2.44 rows=44 width=27) (never executed)
               ->  Materialize  (cost=3.71..3.71 rows=2 width=50) (never executed)
                     ->  Hash Join  (cost=1.02..3.71 rows=2 width=50) (never executed)
                           Hash Cond: ("outer".object_type = "inner".rel_type)
                           ->  Seq Scan on acs_object_types o2  (cost=0.00..2.44 rows=44 width=27) (never executed)
                           ->  Hash  (cost=1.02..1.02 rows=2 width=23) (never executed)
                                 ->  Seq Scan on rel_segments rs  (cost=0.00..1.02 rows=2 width=23) (never executed)
 Total runtime: 1209.02 msec
(35 rows)
Maybe someone who actually understands rel_segments (hint Don :) ) can look into why this is so awful. Given how bad the query is it might be worth adding something that checks that any rel_side 'one' constraints exist before running it.
Collapse
Posted by Don Baccus on
This looks like the worse execution plan I've ever seen out of Postgres!  We win the gold medal!  Are you really sure you want us to give this up for mere, fleeting seconds of CPU time we can use other purposes?

I'll take a look at it this afternoon, thanks for diagnosis work.

Collapse
Posted by Don Baccus on
Well, it's easy to see why it's slow ... it's that three-deep nested loop that does an index scan on group_element_index.  That scan's being executed a shitpile of times if it accumulates over a second execution time over the life of the query ...

I'll dig deeper tomorrow, we're dealing with a complex set of nested selects on views.

The second half of the query doesn't suck quite so bad, only a two-deep nested loop ("only?") that in this case isn't executed at all so it runs fast.

Collapse
Posted by Don Baccus on
Just for future reference ... I haven't looked at the various views buried in this innocent-looking query but ...

"materialize" in a plan is a good indication that one will find a UNION in the query.  PG doesn't optimize UNION at this point in time, probably the optimizer's most glaring weakness.

Oracle will optimize UNIONs in some cases, not in others, it's somewhat hit or miss.

That's why I worked so hard to rework the permissions model to get rid of UNIONs in VIEWs.  This sped up every PG query against the model by orders of magnitude (when there are lots of objects) and in Oracle sped up a significant percentage of the equivalent queries a great deal.

Collapse
Posted by Dirk Gomez on
In Oracle the biggest caveat with UNIONs is that they have an implicit sort attached: to weed out duplicates Oracle needs to sort the result at some point.

E.g. in permissons where you only want to know that there is a row, not how many, you would use a UNION ALL (which returns duplicates, so it doesn't need to sort)

The same holds true probably for Oracle.

Collapse
Posted by Dirk Gomez on
Oh and the really big caveat I just forgot about:

http://download-west.oracle.com/docs/cd/A87860_01/doc/server.817/a76992/optimops.htm

says:

The optimizer can merge a view into a referencing query block when the view has one or more base tables, provided the view does not contain the following:

    * Set operators (UNION, UNION ALL, INTERSECT, MINUS)

    * A CONNECT BY clause

    * A ROWNUM pseudocolumn

    * Aggregate functions (AVG, COUNT, MAX, MIN, SUM) in the select list

Collapse
Posted by Don Baccus on
Yes, actually, Jeff and I switched a lot of UNION to UNION ALL  clauses for 4.6.? for this very reason, when it was clear that there was no possibility of duplicate rows.

Thanks for the Oracle reference - Oracle's ability to optimize in the face of UNION is rather limited, indeed.

Collapse
Posted by Don Baccus on
Jeff ... in my installation on PG 7.3 this query has a much nicer plan (no "materialize", which is a killer) and doesn't seem to run particularly slow with 8000 users.  I'm going to try adding more and more users but have a few questions ...

1. What version of PG are you using?

2. You did "analyze" and restart?  "analyze" has no affect on functions compiled into running backends.  You have to force new backends into existence each of which will then recompile the function using the "analyze" data.  The easiest way to do this is to stop and restart AOLserver (you may already know this, of course)

3. How many groups and relsegs do you have?  It doesn't look like many ... what groups do the users belong to (or how many users are in your various groups)?

Collapse
Posted by Jeff Davis on
I created 500k users, but only 6 rel_segments (for main site and 2 subsites), and 2 rel_constraints (one for each subsites). My daughter contributed to the experiment by turning my computer off so yeah, everything has been restarted after a vacuumdb -f -z.

I have a second db (on the same server) with 100k users, 20 rel_segs, and 3 rel_constraints and the query takes .5ms (rather than the 1-2 seconds with the dataset which is 5x larger). Same

I dropped the group_elem_idx_container_idx and on the 500k user db the query dropped to under 1ms and had this plan:

explain analyze select constraint_id, constraint_name from rel_constraints_violated_one where rel_id = 20938;
                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=26.42..34855.94 rows=1 width=262) (actual time=0.07..0.07 rows=0 loops=1)
   Join Filter: (("inner".segment_id = "outer".required_rel_segment) AND ("inner".element_id = "outer".container_id))
   Filter: ("inner".element_id IS NULL)
   ->  Merge Join  (cost=3.05..10.36 rows=1 width=159) (actual time=0.07..0.07 rows=0 loops=1)
         Merge Cond: ("outer".group_id = "inner".group_id)
         Join Filter: ("outer".object_type = "inner".rel_type)
         ->  Nested Loop  (cost=0.00..17.74 rows=5 width=133) (actual time=0.07..0.07 rows=0 loops=1)
               Join Filter: (("inner".tree_sortkey >= "outer".tree_sortkey) AND ("inner".tree_sortkey <= tree_right("outer".tree_sortkey)))
               ->  Nested Loop  (cost=0.00..14.53 rows=1 width=106) (actual time=0.07..0.07 rows=0 loops=1)
                     Join Filter: ("inner".object_type = "outer".rel_type)
                     ->  Nested Loop  (cost=0.00..11.54 rows=1 width=79) (actual time=0.07..0.07 rows=0 loops=1)
                           Join Filter: ("inner".rel_segment = "outer".segment_id)
                           ->  Index Scan using rel_segments_grp_rel_type_uq on rel_segments rs  (cost=0.00..5.32 rows=6 width=23) (actual time=0.02..0.03 rows=6 loops=1)
                           ->  Seq Scan on rel_constraints rel  (cost=0.00..1.02 rows=1 width=56) (actual time=0.00..0.00 rows=0 loops=6)
                                 Filter: (rel_side = 'one'::bpchar)
                     ->  Seq Scan on acs_object_types o2  (cost=0.00..2.44 rows=44 width=27) (never executed)
               ->  Seq Scan on acs_object_types o1  (cost=0.00..2.44 rows=44 width=27) (never executed)
         ->  Sort  (cost=3.05..3.06 rows=3 width=26) (never executed)
               Sort Key: public.group_element_index.group_id
               ->  Index Scan using group_elem_idx_rel_id_idx on group_element_index  (cost=0.00..3.04 rows=3 width=26) (never executed)
                     Index Cond: (rel_id = 20938)
   ->  Materialize  (cost=33280.71..33280.71 rows=104324 width=103) (never executed)
         ->  Hash Join  (cost=23.36..33280.71 rows=104324 width=103) (never executed)
               Hash Cond: ("outer".rel_type = "inner".object_type)
               Join Filter: ("outer".group_id = "inner".group_id)
               ->  Seq Scan on group_element_index  (cost=0.00..19349.74 rows=927174 width=26) (never executed)
               ->  Hash  (cost=23.29..23.29 rows=29 width=77) (never executed)
                     ->  Nested Loop  (cost=1.07..23.29 rows=29 width=77) (never executed)
                           Join Filter: (("inner".tree_sortkey >= "outer".tree_sortkey) AND ("inner".tree_sortkey <= tree_right("outer".tree_sortkey)))
                           ->  Hash Join  (cost=1.07..4.03 rows=6 width=50) (never executed)
                                 Hash Cond: ("outer".object_type = "inner".rel_type)
                                 ->  Seq Scan on acs_object_types o2  (cost=0.00..2.44 rows=44 width=27) (never executed)
                                 ->  Hash  (cost=1.06..1.06 rows=6 width=23) (never executed)
                                       ->  Seq Scan on rel_segments rs  (cost=0.00..1.06 rows=6 width=23) (never executed)
                           ->  Seq Scan on acs_object_types o1  (cost=0.00..2.44 rows=44 width=27) (never executed)
 Total runtime: 0.29 msec
(36 rows)
Collapse
Posted by Don Baccus on
Man, you've hit something here ...

The execution plan my installation's using is very similar, if not identical, to the last one you've posted that runs in 1MS on your machine (my eyes cross whenever I try to verify it's actually identical :)

How much buffer space do you have allocated?  (-B from the postmaster command line or set in your PG startup control file)

I'm wondering if you might not be disk thrashing?  That index scan does seem to be where the time is being spent, looking at your original explain with timing info.

Then again it may just be the fact that you've materialized the view.

Very strange ...

I think the view can be rewritten using "not exists" and that seems to generate a slightly nicer plan than the one above but I need to test to make sure my rewrite is actually equivalent ...

Collapse
Posted by Jeff Davis on
It could be thrashing I suppose. I have shared_buffers and 8192 (64MB) which is not as large as it could be (although I intentionally set it a little low since I want to catch performance problems where the db does not fit in memory). I will bump it up and see if it makes a difference.

Maybe we should drop the container_id index since I don't think it ever has particularly a particularly good key distribution and so is hardly ever better than a table scan (it's got 2 distinct values on openacs.org).

Even on sloanspace the two big container_id's make up 35% of the rows in group_element_index.

Do we ever really query on container_id w/o also having a group_id?

Collapse
Posted by Jeff Davis on
I bumped shared_buffers up to 256mb timed it w/o the index (.56ms first time, .26ms once things were cached), recreated the index and timed it again 1st time through 1198ms, 5th time through 1149ms so no probably not caching.

I dropped the index and the query was back to .38ms.

I vote do away with the index.

Collapse
Posted by Don Baccus on
Sounds good ... sounds like a brain-fart on the part of the optimizer, sheesh.
Collapse
Posted by Jeff Davis on
I committed the changes and an upgrade script.  I did not
test the upgrade script on postgres 7.2 but I think it should
work there too (and I think it can be run on a 4.6.3 instance
w/o problems as well).