Forum OpenACS Development: Re: Duplicate keys on assessment - possible CR bug

Collapse
Posted by Don Baccus on
cr_dummy is just a dummy table used because the design of ACS required each object type to supply a values table.

That's a requirement we've been talking about relaxing for ages.

Rules should be transactional - every top-level query in PG runs in its own transaction if no global (begin/end) transaction is specified.

Now with PG 8's nested transaction facility things might've changed, I've not dug into that stuff yet...

What kind of lock is causing the deadlock? Some code in assessment may need to be wrapped in an explicit transaction block and protected with a select for update or other form of lock.

I think is a concurrency problem. We are experimenting duplicate keys. This approach eliminates the innecesary use of cr_dummy, but maintain everything else equal. At least for the CR views seems to be a better approach.

So we propose to change not use anymore cr_dummy in CR views, instead call the function inline, comments?

Collapse
Posted by Derick Leony on
Thanks for the answers :)

The lock causing the deadlock is a ShareLock:

ERROR: deadlock detected
DETAIL: Process 27056 waits for ShareLock on transaction 21228135; blocked by process 27061.
Process 27061 waits for ShareLock on transaction 21228133; blocked by process 27056.
CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."users" x WHERE "user_id" = $1 FOR UPDATE OF x"
SQL statement "insert into acs_objects (object_id, object_type, context_id, creation_date, creation_user, creation_ip) values ( $1 , $2 , $3 , $4 , $5 , $6 )"
PL/pgSQL function "acs_object__new" line 24 at SQL statement
PL/pgSQL function "content_item__new" line 84 at assignment

SQL:

select content_item__new('0-27860472-59F6B62A-D884-4D03-AC1C-2D74A2B6E45E','27859840',NULL,NULL,NULL,'0',NULL,'0.0.0.0','content_item','as_sessions',NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'file')

Derick,

you are right, there are two different problems in the CR showing up with heavy concurrency:

a) duplicate revision_id issue:

two concurrent writers cause a conflict with the cr_dummy update, causing a redo of the rule body with the duplicate insert attempt. Your proposed solution won't work for deeper derived types, where the computed value of content_revision__new() is used as first attribute for multiple type records (not only for one as in your example). Do a "\d xowiki_objecti" for an example when you have xowiki install. Using a generated stored procedure is better here, since the return value of content_revision__new() can be stored in a local variable of the procedure.

b) the deadlock problem:

this is the even harder problem, caused by foreign key references to acs_objects (i think). The problem happens not only for acs_objects() but also (for me much more frequent) in the materialized security context table acs_object_context_index(), which i have check in more detail. acs_object_context_index has three attributes, two of which are foreign keys to acs_objects. these foreign keys are the problem.

postgres performs a lock (share lock) on the tuples to which the foreign keys point, apparently to prevent other transactions from modifying the foreign key before this transaction commits. it is practically impossible to cause the references to be always in the same order, so a deadlock can occur. See for a nice simple cases in http://archives.postgresql.org/pgsql-general/2004-09/msg00442.php

the same can happen with the foreign key constraint of context_id in acs_objects as well.

The problem is especially bad for long transactions containing many operations (assessment might be a good example for this). the time of the locking can be shortened by making the constraints deferred (e.g. set these deferrable and use "SET CONSTRAINTS ALL DEFERRED" in the transactions of the CR. The problem won't vanish but occur less often. i would recommend this as a short term improvement.

For testing purposes, i changed the cr-repository such that revisions are no acs-objects any more (removed the acs_object__new withing content_revision__new). now, all deadlocks are gone. this is not a compatible change, since one looses the ability to give permissions on the revision granularity (one can still give permissions on the item_id) and attributes like modifiying-user would have to be stored in the revision tuple, and some queries have to be rewritten. i would believe, that for the sake of robustness and speed, this would be acceptable for many applications. Another idea would be to drop the FK constraint and add some more delete trigger plus maybe a sweeper for orphans. Maybe, some other other ideas will show up.

while this might be a working solution for the CR, i am pretty sure, the same problems with deadlocks caused by FKs can show up for many other acs-objects as well. The large acs_objects table with its many triggers has some issues. But addressing this issue will require many TIPs.

For duplicate key issue, I agree, a generated function will be the best, and drop the use of cr_dummy.

For the CR deadlock there is no easy solution, though your suggestions are interesting, specially since we don't need permission granularity at the revision level for most of the cases... As a short term, and due assessment is showing multiple deadlocks, we are trying to produce an assessment version which uses CR less.

rocael, if you are working on assessment, i would also recommend to store the items not in the file system, but in strings in the database. This is a simple change, which should make things faster, and reduce the deadlocks as well.

i did this early this week, since my first hypothesis was that maybe the locking for preventing simultaneous writes on the same file could cause the deadlock. the hypothesis proved wrong in a stress test, but i would believe, it improves the current state. If you are interested in a patch, grab it from here: http://media.wu-wien.ac.at/download/as-storage-type.patch

the patch provides a single place to change the storage type. since the cr allows for different instances of one content type different storage types, one can use this easily for testing without loosing old content. For me, the assessment code looks as a good example, why more object orientation is needed in openacs.

Collapse
Posted by Rocael Hernández Rizzardini on
I think we found a good solution for the deadlock problem.
We LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE.
In our tests (50 concurrent threads creating revisions) in a good oacs DB (7M acs_objets, 2M cr_revisions), we have NO deadlocks!!!

So you do in your plsql call:
Begin
LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE;
your code (usually the insert to the view in the CR case);
Return 0;
End;

You lock only the row that you care for (this locks the primary key of the table), and share the rest. Enjoy!

As a note, just adding Begin end to the transaction reduced the number of deadlocks, but still some happen, and in the way it is now, around ~75% are deadlocks. So a change in this direction must happen.

Collapse
Posted by Gustaf Neumann on
how is the performance? if i see corretly, we are loosing MVCC in these cases - maybe still the best choice. The same lock will be needed for other inserts/updates as well with FK to oacs_objects. anyhow, still bettern than the deadlock. i'll redo my changes an do some tests as well.

-gustaf

Collapse
Posted by Gustaf Neumann on
just to follow up my own note. My tests came to the following results:

- the particular lock does not help on duplicate key issue (would not have expected it). i was using my version of the rule with the stored procedure in it for my tests.

- the lock on acs_objects avoids the deadlock for my test cases, no matter whether the lock was in content_item__new or in content revision new (triggered via the rule).

- for my tests, the performance was 10% faster when the lock was used in in the content_item new transaction

- comparing storage_method "file" with "text": storage method text appears only slightly faster than storage_method file (tested with short contents)

concerning the duplicate key issue: by altering the rule into something like
CREATE or REPLACE RULE xowiki_page_r AS ON INSERT TO xowiki_pagei DO INSTEAD SELECT xowiki_pagei_f(new);
we have the effect, that an INSERT INTO xowiki_pagei (a dml statement) gets rewritten into a select statement. oacs complains, if one tries to run a select in a db_dml. The best trick to allow people (e.g. content::revision::new) still use the db_dml for the insert on VIEWi is to put an "begin; ... end;" around the sql-query in the db_dml implementation (the trick is due to Neophytos). Other suggestions?

Collapse
Posted by Gustaf Neumann on
rocael, can you check wether for your test cases the lock in content_item new is sufficient as well?
content_item__new is sufficient in our cases as well, did you add in specific place?
We plan to provide a patch for HEAD (deadlocks).
Will be good if other OCT members comment on what we have discussed, then we can apply the patch to HEAD.
Also, we need a patch for dropping the use of cr_dummy in favor of the function call.
Collapse
Posted by Gustaf Neumann on
today is familiy day. this evening i'll provide a patch/update.
Collapse
Posted by Gustaf Neumann on
Here is a patch + upgrade script.

The approach with the BEGIN .. END in dml statements (mentioned earlier) lead to other problems, so i changed for the time being the few instances, where sql statements inserts directly into the TABLENAMEi view, the command from db_dml to db_0or1row. This is necessary, since the rule processor of postgres rewrites now the insert rule into a select query calling the stored procedure. This happens only in three cases (acs-content-repository, dynamic-types, xowiki). There is as well an update to the TABLENAMEi view in form-procs of dynamic types, it might be necessary to check if this is ok.

The upgrade script contains the altered rule generator which generates now a rule + a stored procedure, which uses a variable to hold the temporary result. This should be faster than the old version based on cr_dummy, which essentially allows only one insert at a time.

The code was only tested with postgresql 8.0.

This patch is not intended for the general public,
but for developers for testing purposes. so, the
patch is against CVS head.

How to use this:
- make a dump of your database
- run the .sql file with
psql -U ... < cr-deadlock-duplicate-upgrade.sql
- cd packages; patch -p0 < cr-deadlock-duplicate.patch-HEAD

http://media.wu-wien.ac.at/download/cr-deadlock-duplicate-upgrade.sql
http://media.wu-wien.ac.at/download/cr-deadlock-duplicate.patch-HEAD

If we want to proceed with the patch, it will be necessary to provide a patch for content-types.sql as well.

enjoy.

Collapse
Posted by Derick Leony on
The patch worked great for one of our development instances with PG8.0.

I tried to modify the SQL script in order to apply the patch on PG 7.4.7 but since the "new" object of the rule can't be sent as a parameter the functions have to be generated with a parameter per column. This led to another problem because there are some revision views (e.g. as_sessionsi) with more than 32 attributes and 32 is the parameter limit for PG functions.

So, the chosen approach was to apply this change to every view with 32 or less attributes. Any advice will be really appreciated, though I think upgrading to PG8.1 would be the best to do :)

Collapse
Posted by Malte Sussdorff on
Can you provide a patch or commit your change to the repository? This would help a lot.