Forum OpenACS Development: Duplicate keys on assessment - possible CR bug
ERROR: duplicate key violates unique constraint "acs_objects_pk"
CONTEXT: SQL statement "insert into acs_objects (object_id, object_type, context_id, creation_date, creation_user, creation_ip) values ( $1 , $2 , $3 , $4 , $5 , $6 )"
PL/pgSQL function "acs_object__new" line 24 at SQL statement
PL/pgSQL function "content_revision__new" line 22 at assignment
PL/pgSQL function "content_revision__new" line 14 at return
I'm running OpenACS 5.1 on PG 8.0.7.
Guided by Dave's post about cr_dummy http://www.openacs.org/forums/message-view?message_id=400513, I modified the content_type__refresh_view function so it doesn't use the cr_dummy table and the call to content_revision__new is done into the insert to the content type specific table.
For example, the as_sessionsi rule used to be:
ON INSERT TO as_sessionsi DO INSTEAD ( UPDATE cr_dummy SET val = ( SELECT content_revision__new(new.title, new.description::character varying, now(), new.mime_type, new.nls_language,
WHEN new.text IS NULL THEN new.data
END, content_symlink__resolve(new.item_id), new.revision_id, now(), new.creation_user, new.creation_ip) AS content_revision__new);
INSERT INTO as_sessions (session_id, assessment_id, subject_id, staff_id, target_datetime, creation_datetime, first_mod_datetime, last_mod_datetime, completed_datetime, session_status, assessment_status, percent_score)
VALUES (cr_dummy.val, new.assessment_id, new.subject_id, new.staff_id, new.target_datetime, new.creation_datetime, new.first_mod_datetime, new.last_mod_datetime, new.completed_datetime, new.session_status, new.assessment_status, new.percent_score);
after the change, it is:
ON INSERT TO as_sessionsi DO INSTEAD INSERT INTO as_sessions (session_id, assessment_id, subject_id, staff_id, target_datetime, creation_datetime, first_mod_datetime, last_mod_datetime, completed_datetime, session_status, assessment_status, percent_score)
VALUES (content_revision__new(new.title, new.description::character varying, now(), new.mime_type, new.nls_language,
WHEN new.text IS NULL THEN new.data
END, content_symlink__resolve(new.item_id), new.revision_id, now(), new.creation_user, new.creation_ip), new.assessment_id, new.subject_id, new.staff_id, new.target_datetime, new.creation_datetime, new.first_mod_datetime, new.last_mod_datetime, new.completed_datetime, new.session_status, new.assessment_status, new.percent_score)
I made this change willing to reduce the number of deadlocks, but it happened to stop rising unique constraint violations though deadlocks still happen.
I searched in postgresql forums wether rules are transactional or not, and found many different and contradictory answers, so I'm not sure about it but, because of the results I got, rules don't seem to be transactional.
Does cr_dummy have any other purpose than acting as a variable inside each view rule?
Does anyone know for sure if PG rules are transaction safe (specifically the rules of CR views)?
Would the inline function call be a better approach for this case?
Thanks in advance,
That's a requirement we've been talking about relaxing for ages.
Rules should be transactional - every top-level query in PG runs in its own transaction if no global (begin/end) transaction is specified.
Now with PG 8's nested transaction facility things might've changed, I've not dug into that stuff yet...
What kind of lock is causing the deadlock? Some code in assessment may need to be wrapped in an explicit transaction block and protected with a select for update or other form of lock.
So we propose to change not use anymore cr_dummy in CR views, instead call the function inline, comments?
The lock causing the deadlock is a ShareLock:
ERROR: deadlock detected
DETAIL: Process 27056 waits for ShareLock on transaction 21228135; blocked by process 27061.
Process 27061 waits for ShareLock on transaction 21228133; blocked by process 27056.
CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."users" x WHERE "user_id" = $1 FOR UPDATE OF x"
SQL statement "insert into acs_objects (object_id, object_type, context_id, creation_date, creation_user, creation_ip) values ( $1 , $2 , $3 , $4 , $5 , $6 )"
PL/pgSQL function "acs_object__new" line 24 at SQL statement
PL/pgSQL function "content_item__new" line 84 at assignment
you are right, there are two different problems in the CR showing up with heavy concurrency:
a) duplicate revision_id issue:
two concurrent writers cause a conflict with the cr_dummy update, causing a redo of the rule body with the duplicate insert attempt. Your proposed solution won't work for deeper derived types, where the computed value of content_revision__new() is used as first attribute for multiple type records (not only for one as in your example). Do a "\d xowiki_objecti" for an example when you have xowiki install. Using a generated stored procedure is better here, since the return value of content_revision__new() can be stored in a local variable of the procedure.
b) the deadlock problem:
this is the even harder problem, caused by foreign key references to acs_objects (i think). The problem happens not only for acs_objects() but also (for me much more frequent) in the materialized security context table acs_object_context_index(), which i have check in more detail. acs_object_context_index has three attributes, two of which are foreign keys to acs_objects. these foreign keys are the problem.
postgres performs a lock (share lock) on the tuples to which the foreign keys point, apparently to prevent other transactions from modifying the foreign key before this transaction commits. it is practically impossible to cause the references to be always in the same order, so a deadlock can occur. See for a nice simple cases in http://archives.postgresql.org/pgsql-general/2004-09/msg00442.php
the same can happen with the foreign key constraint of context_id in acs_objects as well.
The problem is especially bad for long transactions containing many operations (assessment might be a good example for this). the time of the locking can be shortened by making the constraints deferred (e.g. set these deferrable and use "SET CONSTRAINTS ALL DEFERRED" in the transactions of the CR. The problem won't vanish but occur less often. i would recommend this as a short term improvement.
For testing purposes, i changed the cr-repository such that revisions are no acs-objects any more (removed the acs_object__new withing content_revision__new). now, all deadlocks are gone. this is not a compatible change, since one looses the ability to give permissions on the revision granularity (one can still give permissions on the item_id) and attributes like modifiying-user would have to be stored in the revision tuple, and some queries have to be rewritten. i would believe, that for the sake of robustness and speed, this would be acceptable for many applications. Another idea would be to drop the FK constraint and add some more delete trigger plus maybe a sweeper for orphans. Maybe, some other other ideas will show up.
while this might be a working solution for the CR, i am pretty sure, the same problems with deadlocks caused by FKs can show up for many other acs-objects as well. The large acs_objects table with its many triggers has some issues. But addressing this issue will require many TIPs.
For the CR deadlock there is no easy solution, though your suggestions are interesting, specially since we don't need permission granularity at the revision level for most of the cases... As a short term, and due assessment is showing multiple deadlocks, we are trying to produce an assessment version which uses CR less.
i did this early this week, since my first hypothesis was that maybe the locking for preventing simultaneous writes on the same file could cause the deadlock. the hypothesis proved wrong in a stress test, but i would believe, it improves the current state. If you are interested in a patch, grab it from here: http://media.wu-wien.ac.at/download/as-storage-type.patch
the patch provides a single place to change the storage type. since the cr allows for different instances of one content type different storage types, one can use this easily for testing without loosing old content. For me, the assessment code looks as a good example, why more object orientation is needed in openacs.
We LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE.
In our tests (50 concurrent threads creating revisions) in a good oacs DB (7M acs_objets, 2M cr_revisions), we have NO deadlocks!!!
So you do in your plsql call:
LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE;
your code (usually the insert to the view in the CR case);
You lock only the row that you care for (this locks the primary key of the table), and share the rest. Enjoy!
As a note, just adding Begin end to the transaction reduced the number of deadlocks, but still some happen, and in the way it is now, around ~75% are deadlocks. So a change in this direction must happen.
- the particular lock does not help on duplicate key issue (would not have expected it). i was using my version of the rule with the stored procedure in it for my tests.
- the lock on acs_objects avoids the deadlock for my test cases, no matter whether the lock was in content_item__new or in content revision new (triggered via the rule).
- for my tests, the performance was 10% faster when the lock was used in in the content_item new transaction
- comparing storage_method "file" with "text": storage method text appears only slightly faster than storage_method file (tested with short contents)
concerning the duplicate key issue: by altering the rule into something like
CREATE or REPLACE RULE xowiki_page_r AS ON INSERT TO xowiki_pagei DO INSTEAD SELECT xowiki_pagei_f(new);
we have the effect, that an INSERT INTO xowiki_pagei (a dml statement) gets rewritten into a select statement. oacs complains, if one tries to run a select in a db_dml. The best trick to allow people (e.g. content::revision::new) still use the db_dml for the insert on VIEWi is to put an "begin; ... end;" around the sql-query in the db_dml implementation (the trick is due to Neophytos). Other suggestions?
We plan to provide a patch for HEAD (deadlocks).
Will be good if other OCT members comment on what we have discussed, then we can apply the patch to HEAD.
Also, we need a patch for dropping the use of cr_dummy in favor of the function call.
The approach with the BEGIN .. END in dml statements (mentioned earlier) lead to other problems, so i changed for the time being the few instances, where sql statements inserts directly into the TABLENAMEi view, the command from db_dml to db_0or1row. This is necessary, since the rule processor of postgres rewrites now the insert rule into a select query calling the stored procedure. This happens only in three cases (acs-content-repository, dynamic-types, xowiki). There is as well an update to the TABLENAMEi view in form-procs of dynamic types, it might be necessary to check if this is ok.
The upgrade script contains the altered rule generator which generates now a rule + a stored procedure, which uses a variable to hold the temporary result. This should be faster than the old version based on cr_dummy, which essentially allows only one insert at a time.
The code was only tested with postgresql 8.0.
This patch is not intended for the general public,
but for developers for testing purposes. so, the
patch is against CVS head.
How to use this:
- make a dump of your database
- run the .sql file with
psql -U ... < cr-deadlock-duplicate-upgrade.sql
- cd packages; patch -p0 < cr-deadlock-duplicate.patch-HEAD
If we want to proceed with the patch, it will be necessary to provide a patch for content-types.sql as well.
I tried to modify the SQL script in order to apply the patch on PG 7.4.7 but since the "new" object of the rule can't be sent as a parameter the functions have to be generated with a parameter per column. This led to another problem because there are some revision views (e.g. as_sessionsi) with more than 32 attributes and 32 is the parameter limit for PG functions.
So, the chosen approach was to apply this change to every view with 32 or less attributes. Any advice will be really appreciated, though I think upgrading to PG8.1 would be the best to do :)
There are different options to solve this:
- Creating a function for each content type and call this function from the rule, the down side of this is that the insert would return a row and thus the db driver would return a "not dml nor ddl" exception, so the insert to the view would have to be invoked from a db_plpsql_exec and it would break any external direct insert to the view
- Having also a "dummy" dml call into the rule that would be invoked just to execute the function
- Taking content_revision__new call out of the rule, invoking it before the insert and the rule would just insert one row into the content type corresponding table
Which option makes more sense? (if any)
Should we add this to content::revision::new ?
I am not sure, how Oracle deals with updates on foreign keys, but i would expect, it does not have this problem as well. It is likely, that in times before MVCC, postgres did not have these issues either.
Both problems are postgres related. Oracle might have problems on its own, but not these. So far, with this patch, i have not seem both of the problems any more.
I am not sure, how Oracle deals with updates on foreign keys, but i would expect, it does not have this problem as well. It is likely, that in times before MVCC, postgres did not have these issues either.PG didn't implement foreign key constraints until well after MVCC was implemented (I was one of the people who worked on the first foreign key constraint implementation). Bottom line is that use of foreign key constraints should not cause deadlocks and should not require explicit locking by the application. We should view this as a PG bug that needs a workaround. Has anyone tried this with the latest version of PG? I know locking issues were being worked on for foreign key constraints oh ... a year or so ago???
i have committed the discussed patch with some additions to cvs head. the modifications contain
- a new version number for acs-content-repository (5.3.0d2)
- an upgrade script 5.3.0d1-5.3.0d2
- the lock for item_new and revision_new
- cr_dummy is removed
- the update to change db_dml to db_0or1row in dynamic-types
i have tested the upgrade from 5.3.0d1 and a new install
with acs-content-repository 5.3.0d2. i have not tested dynamic types. The update of xotcl-core will follow separately, since it accompanies a major update for xowiki.
note, that the head version requires now postgresql 8.*
In the University of Valencia we have the same problem with assessment (ERROR: duplicate key violates unique constraint "acs_objects_pk")
It is possible to apply the patch to branch oacs_5_2, somebody has tried it?
We have the assessment version 0.22d2.
We copied these files from head to 5_2 to solve the problem on our production instance:
At the moment it seems that it worked well.
Thanks for the help
This are great news, but this does not simplify release management. Interestingly enough, since OpenACS 5.3 seems to work under Oracle, i deduce that
db_dml lock_objects "LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE"
works under oracle as well. I would think, that in oracle, this locking is not needed at all, and in the postgres case, we should only do this test if the following returns 1:
select 1 where substring(version() from 'PostgreSQL #"[0-9]+.[0-9+]#".%' for '#') < 8.2;
I would hate to do the version check on every insert operation. not sure, what the best approach for acs-content-repository is. On the xotcl-side, the newest version of xotcl-core creates now depending on postgres+version conditionally a locking rule; so there is no runtime penalty for version checking.
a) Require PG 8.2 or higher for OpenACS 5.4
b) For 5.3 keep the lock table statements in, or, alternatively, provide a parameter in acs-content-repository "OldPGP" which is turned on if upon installation (or upgrade) you find that the system is running on an Old DB version. Or store it in an nsv_array and check upon starting the AOLserver.
Interestingly I got problems with "lock table acs-objects in share row exclusive mode" on a farily busy site running against PG 8.2.3.
So, if someone experiences more deadlocks, a simple test case would be certainly helpful to nail this down.
We did test and everything is fine on this regard, now we have in production that.
I´m sure oracle did not have this silly deadlock.
The new code is better and uses a function instead of the cr_dummy table. I don't see any reason to revert it. Luckily we were able to fix it by using new features in PG 8 what were not available in PG 7.
Does the currently committed new code include the explicit locking proposed by Rocael (LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE;)?
If so, someone mentioned that this may undermine MVCC in some cases. Is this correct and if so are we sure that the new code is as good as it can be for PG versions 8.2.x and later?
In my understanding about this thread, the Deadlock problem is solved when you use PostgreSQL 8.2.X, right? So, in OpenACS 5.4 is the lock removed? Because I'm watching the HEAD version of acs-content-repository and the lock is still there. I'm running an assessment with a lot of concurrent users and this lock seems a performance problem to me.
With PostgreSQL MVCC improvements for 8.2.X, can I just coment out the lock line in assessment? Or is somebody going to fix this for the next version?
So I'd encorage anyone who with a newer version of Postgres to make sure those LOCK TABLE statements are not being used. Look in acs-content-repository/tcl/content-item-procs.tcl and content-revision-procs.tcl.