Forum OpenACS Development: Duplicate keys on assessment - possible CR bug

While working on assessment performance I developed some stress test cases for the submission of an assessment answer. This test creates 10 concurrent sessions for a given assessment and when running it many deadlocks are risen but there are also many unique constraints violations:
ERROR: duplicate key violates unique constraint "acs_objects_pk"
CONTEXT: SQL statement "insert into acs_objects (object_id, object_type, context_id, creation_date, creation_user, creation_ip) values ( $1 , $2 , $3 , $4 , $5 , $6 )"
PL/pgSQL function "acs_object__new" line 24 at SQL statement
PL/pgSQL function "content_revision__new" line 22 at assignment
PL/pgSQL function "content_revision__new" line 14 at return

I'm running OpenACS 5.1 on PG 8.0.7.

Guided by Dave's post about cr_dummy http://www.openacs.org/forums/message-view?message_id=400513, I modified the content_type__refresh_view function so it doesn't use the cr_dummy table and the call to content_revision__new is done into the insert to the content type specific table.

For example, the as_sessionsi rule used to be:
as_sessions_r AS
ON INSERT TO as_sessionsi DO INSTEAD ( UPDATE cr_dummy SET val = ( SELECT content_revision__new(new.title, new.description::character varying, now(), new.mime_type, new.nls_language,
CASE
WHEN new.text IS NULL THEN new.data
ELSE new.text
END, content_symlink__resolve(new.item_id), new.revision_id, now(), new.creation_user, new.creation_ip) AS content_revision__new);
INSERT INTO as_sessions (session_id, assessment_id, subject_id, staff_id, target_datetime, creation_datetime, first_mod_datetime, last_mod_datetime, completed_datetime, session_status, assessment_status, percent_score)
VALUES (cr_dummy.val, new.assessment_id, new.subject_id, new.staff_id, new.target_datetime, new.creation_datetime, new.first_mod_datetime, new.last_mod_datetime, new.completed_datetime, new.session_status, new.assessment_status, new.percent_score);
)

after the change, it is:
as_sessions_r AS
ON INSERT TO as_sessionsi DO INSTEAD INSERT INTO as_sessions (session_id, assessment_id, subject_id, staff_id, target_datetime, creation_datetime, first_mod_datetime, last_mod_datetime, completed_datetime, session_status, assessment_status, percent_score)
VALUES (content_revision__new(new.title, new.description::character varying, now(), new.mime_type, new.nls_language,
CASE
WHEN new.text IS NULL THEN new.data
ELSE new.text
END, content_symlink__resolve(new.item_id), new.revision_id, now(), new.creation_user, new.creation_ip), new.assessment_id, new.subject_id, new.staff_id, new.target_datetime, new.creation_datetime, new.first_mod_datetime, new.last_mod_datetime, new.completed_datetime, new.session_status, new.assessment_status, new.percent_score)

I made this change willing to reduce the number of deadlocks, but it happened to stop rising unique constraint violations though deadlocks still happen.

I searched in postgresql forums wether rules are transactional or not, and found many different and contradictory answers, so I'm not sure about it but, because of the results I got, rules don't seem to be transactional.

Does cr_dummy have any other purpose than acting as a variable inside each view rule?
Does anyone know for sure if PG rules are transaction safe (specifically the rules of CR views)?
Would the inline function call be a better approach for this case?

Thanks in advance,

Derick

Collapse
Posted by Derick Leony on
The CR version is 5.1.6d1
Collapse
Posted by Don Baccus on
cr_dummy is just a dummy table used because the design of ACS required each object type to supply a values table.

That's a requirement we've been talking about relaxing for ages.

Rules should be transactional - every top-level query in PG runs in its own transaction if no global (begin/end) transaction is specified.

Now with PG 8's nested transaction facility things might've changed, I've not dug into that stuff yet...

What kind of lock is causing the deadlock? Some code in assessment may need to be wrapped in an explicit transaction block and protected with a select for update or other form of lock.

I think is a concurrency problem. We are experimenting duplicate keys. This approach eliminates the innecesary use of cr_dummy, but maintain everything else equal. At least for the CR views seems to be a better approach.

So we propose to change not use anymore cr_dummy in CR views, instead call the function inline, comments?

Collapse
Posted by Derick Leony on
Thanks for the answers :)

The lock causing the deadlock is a ShareLock:

ERROR: deadlock detected
DETAIL: Process 27056 waits for ShareLock on transaction 21228135; blocked by process 27061.
Process 27061 waits for ShareLock on transaction 21228133; blocked by process 27056.
CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."users" x WHERE "user_id" = $1 FOR UPDATE OF x"
SQL statement "insert into acs_objects (object_id, object_type, context_id, creation_date, creation_user, creation_ip) values ( $1 , $2 , $3 , $4 , $5 , $6 )"
PL/pgSQL function "acs_object__new" line 24 at SQL statement
PL/pgSQL function "content_item__new" line 84 at assignment

SQL:

select content_item__new('0-27860472-59F6B62A-D884-4D03-AC1C-2D74A2B6E45E','27859840',NULL,NULL,NULL,'0',NULL,'0.0.0.0','content_item','as_sessions',NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'file')

Derick,

you are right, there are two different problems in the CR showing up with heavy concurrency:

a) duplicate revision_id issue:

two concurrent writers cause a conflict with the cr_dummy update, causing a redo of the rule body with the duplicate insert attempt. Your proposed solution won't work for deeper derived types, where the computed value of content_revision__new() is used as first attribute for multiple type records (not only for one as in your example). Do a "\d xowiki_objecti" for an example when you have xowiki install. Using a generated stored procedure is better here, since the return value of content_revision__new() can be stored in a local variable of the procedure.

b) the deadlock problem:

this is the even harder problem, caused by foreign key references to acs_objects (i think). The problem happens not only for acs_objects() but also (for me much more frequent) in the materialized security context table acs_object_context_index(), which i have check in more detail. acs_object_context_index has three attributes, two of which are foreign keys to acs_objects. these foreign keys are the problem.

postgres performs a lock (share lock) on the tuples to which the foreign keys point, apparently to prevent other transactions from modifying the foreign key before this transaction commits. it is practically impossible to cause the references to be always in the same order, so a deadlock can occur. See for a nice simple cases in http://archives.postgresql.org/pgsql-general/2004-09/msg00442.php

the same can happen with the foreign key constraint of context_id in acs_objects as well.

The problem is especially bad for long transactions containing many operations (assessment might be a good example for this). the time of the locking can be shortened by making the constraints deferred (e.g. set these deferrable and use "SET CONSTRAINTS ALL DEFERRED" in the transactions of the CR. The problem won't vanish but occur less often. i would recommend this as a short term improvement.

For testing purposes, i changed the cr-repository such that revisions are no acs-objects any more (removed the acs_object__new withing content_revision__new). now, all deadlocks are gone. this is not a compatible change, since one looses the ability to give permissions on the revision granularity (one can still give permissions on the item_id) and attributes like modifiying-user would have to be stored in the revision tuple, and some queries have to be rewritten. i would believe, that for the sake of robustness and speed, this would be acceptable for many applications. Another idea would be to drop the FK constraint and add some more delete trigger plus maybe a sweeper for orphans. Maybe, some other other ideas will show up.

while this might be a working solution for the CR, i am pretty sure, the same problems with deadlocks caused by FKs can show up for many other acs-objects as well. The large acs_objects table with its many triggers has some issues. But addressing this issue will require many TIPs.

For duplicate key issue, I agree, a generated function will be the best, and drop the use of cr_dummy.

For the CR deadlock there is no easy solution, though your suggestions are interesting, specially since we don't need permission granularity at the revision level for most of the cases... As a short term, and due assessment is showing multiple deadlocks, we are trying to produce an assessment version which uses CR less.

rocael, if you are working on assessment, i would also recommend to store the items not in the file system, but in strings in the database. This is a simple change, which should make things faster, and reduce the deadlocks as well.

i did this early this week, since my first hypothesis was that maybe the locking for preventing simultaneous writes on the same file could cause the deadlock. the hypothesis proved wrong in a stress test, but i would believe, it improves the current state. If you are interested in a patch, grab it from here: http://media.wu-wien.ac.at/download/as-storage-type.patch

the patch provides a single place to change the storage type. since the cr allows for different instances of one content type different storage types, one can use this easily for testing without loosing old content. For me, the assessment code looks as a good example, why more object orientation is needed in openacs.

Collapse
Posted by Rocael Hernández Rizzardini on
I think we found a good solution for the deadlock problem.
We LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE.
In our tests (50 concurrent threads creating revisions) in a good oacs DB (7M acs_objets, 2M cr_revisions), we have NO deadlocks!!!

So you do in your plsql call:
Begin
LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE;
your code (usually the insert to the view in the CR case);
Return 0;
End;

You lock only the row that you care for (this locks the primary key of the table), and share the rest. Enjoy!

As a note, just adding Begin end to the transaction reduced the number of deadlocks, but still some happen, and in the way it is now, around ~75% are deadlocks. So a change in this direction must happen.

Collapse
Posted by Gustaf Neumann on
how is the performance? if i see corretly, we are loosing MVCC in these cases - maybe still the best choice. The same lock will be needed for other inserts/updates as well with FK to oacs_objects. anyhow, still bettern than the deadlock. i'll redo my changes an do some tests as well.

-gustaf

Collapse
Posted by Gustaf Neumann on
just to follow up my own note. My tests came to the following results:

- the particular lock does not help on duplicate key issue (would not have expected it). i was using my version of the rule with the stored procedure in it for my tests.

- the lock on acs_objects avoids the deadlock for my test cases, no matter whether the lock was in content_item__new or in content revision new (triggered via the rule).

- for my tests, the performance was 10% faster when the lock was used in in the content_item new transaction

- comparing storage_method "file" with "text": storage method text appears only slightly faster than storage_method file (tested with short contents)

concerning the duplicate key issue: by altering the rule into something like
CREATE or REPLACE RULE xowiki_page_r AS ON INSERT TO xowiki_pagei DO INSTEAD SELECT xowiki_pagei_f(new);
we have the effect, that an INSERT INTO xowiki_pagei (a dml statement) gets rewritten into a select statement. oacs complains, if one tries to run a select in a db_dml. The best trick to allow people (e.g. content::revision::new) still use the db_dml for the insert on VIEWi is to put an "begin; ... end;" around the sql-query in the db_dml implementation (the trick is due to Neophytos). Other suggestions?

Collapse
Posted by Gustaf Neumann on
rocael, can you check wether for your test cases the lock in content_item new is sufficient as well?
content_item__new is sufficient in our cases as well, did you add in specific place?
We plan to provide a patch for HEAD (deadlocks).
Will be good if other OCT members comment on what we have discussed, then we can apply the patch to HEAD.
Also, we need a patch for dropping the use of cr_dummy in favor of the function call.
Collapse
Posted by Gustaf Neumann on
today is familiy day. this evening i'll provide a patch/update.
Collapse
Posted by Gustaf Neumann on
Here is a patch + upgrade script.

The approach with the BEGIN .. END in dml statements (mentioned earlier) lead to other problems, so i changed for the time being the few instances, where sql statements inserts directly into the TABLENAMEi view, the command from db_dml to db_0or1row. This is necessary, since the rule processor of postgres rewrites now the insert rule into a select query calling the stored procedure. This happens only in three cases (acs-content-repository, dynamic-types, xowiki). There is as well an update to the TABLENAMEi view in form-procs of dynamic types, it might be necessary to check if this is ok.

The upgrade script contains the altered rule generator which generates now a rule + a stored procedure, which uses a variable to hold the temporary result. This should be faster than the old version based on cr_dummy, which essentially allows only one insert at a time.

The code was only tested with postgresql 8.0.

This patch is not intended for the general public,
but for developers for testing purposes. so, the
patch is against CVS head.

How to use this:
- make a dump of your database
- run the .sql file with
psql -U ... < cr-deadlock-duplicate-upgrade.sql
- cd packages; patch -p0 < cr-deadlock-duplicate.patch-HEAD

http://media.wu-wien.ac.at/download/cr-deadlock-duplicate-upgrade.sql
http://media.wu-wien.ac.at/download/cr-deadlock-duplicate.patch-HEAD

If we want to proceed with the patch, it will be necessary to provide a patch for content-types.sql as well.

enjoy.

Collapse
Posted by Derick Leony on
The patch worked great for one of our development instances with PG8.0.

I tried to modify the SQL script in order to apply the patch on PG 7.4.7 but since the "new" object of the rule can't be sent as a parameter the functions have to be generated with a parameter per column. This led to another problem because there are some revision views (e.g. as_sessionsi) with more than 32 attributes and 32 is the parameter limit for PG functions.

So, the chosen approach was to apply this change to every view with 32 or less attributes. Any advice will be really appreciated, though I think upgrading to PG8.1 would be the best to do :)

Collapse
Posted by Malte Sussdorff on
Can you provide a patch or commit your change to the repository? This would help a lot.
Collapse
Posted by Derick Leony on
Ok, the inline function call is not a better approach since there could be more than one insert in the rule depending on the content type hirarchy.

There are different options to solve this:

  • Creating a function for each content type and call this function from the rule, the down side of this is that the insert would return a row and thus the db driver would return a "not dml nor ddl" exception, so the insert to the view would have to be invoked from a db_plpsql_exec and it would break any external direct insert to the view
  • Having also a "dummy" dml call into the rule that would be invoked just to execute the function
  • Taking content_revision__new call out of the rule, invoking it before the insert and the rule would just insert one row into the content type corresponding table

Which option makes more sense? (if any)

Collapse
Posted by Andrew Piskorski on
This sounds related to the duplicate primary key problems Malte was posting about in June and July.
Collapse
Posted by Gustaf Neumann on
this looks exactly like the same bug. forgot to point this out. the first hypothesis, that the sequence produces duplicate values was not correct. i have an heavily verbose installation that shows clearly that in cases where he error is raised the generated keys are unique.
Collapse
Posted by Dave Bauer on
Rocael,

Should we add this to content::revision::new ?

Right, that's the one!
Collapse
Posted by Andrew Piskorski on
Are the deadlock and duplicate revision_id problems discussed in this thread 100% specific to PostgreSQL? And OpenACS on Oracle is not subject to these problems at all? It sounds as if that's the case, but I haven't yet heard anyone say so for sure...
Collapse
Posted by Dave Bauer on
Oracle does not use cr_dummy, so it should not be subject to duplicate key issues.
Oracle has triggers on views, so it uses per default a stored procedure, id does not need the trick with cr_dummy in the generated postgres rule; oracle cannot have this issue with duplicates. The proposed version is closer to the oracle version.

I am not sure, how Oracle deals with updates on foreign keys, but i would expect, it does not have this problem as well. It is likely, that in times before MVCC, postgres did not have these issues either.

Both problems are postgres related. Oracle might have problems on its own, but not these. So far, with this patch, i have not seem both of the problems any more.

Collapse
Posted by Don Baccus on
I am not sure, how Oracle deals with updates on foreign keys, but i would expect, it does not have this problem as well. It is likely, that in times before MVCC, postgres did not have these issues either.
PG didn't implement foreign key constraints until well after MVCC was implemented (I was one of the people who worked on the first foreign key constraint implementation). Bottom line is that use of foreign key constraints should not cause deadlocks and should not require explicit locking by the application. We should view this as a PG bug that needs a workaround. Has anyone tried this with the latest version of PG? I know locking issues were being worked on for foreign key constraints oh ... a year or so ago???
Gustaf, after today's OCT meeting we decided that you can go ahead and commit to HEAD this patch.
Collapse
Posted by Gustaf Neumann on
Hi everybody,

i have committed the discussed patch with some additions to cvs head. the modifications contain
- a new version number for acs-content-repository (5.3.0d2)
- an upgrade script 5.3.0d1-5.3.0d2
- the lock for item_new and revision_new
- cr_dummy is removed
- the update to change db_dml to db_0or1row in dynamic-types

i have tested the upgrade from 5.3.0d1 and a new install
with acs-content-repository 5.3.0d2. i have not tested dynamic types. The update of xotcl-core will follow separately, since it accompanies a major update for xowiki.

note, that the head version requires now postgresql 8.*

Collapse
Posted by Dario Roig on
Hi!

In the University of Valencia we have the same problem with assessment (ERROR: duplicate key violates unique constraint "acs_objects_pk")
It is possible to apply the patch to branch oacs_5_2, somebody has tried it?

We have the assessment version 0.22d2.

Thanks.

yes, you can apply the patch to 5.1 or 5.2, the real requirement is to have PG 8.x. Galileo had that patch in 5.1 and now in 5.2
Collapse
Posted by Dario Roig on
Hi!

We copied these files from head to 5_2 to solve the problem on our production instance:

openacs-4/packages/acs-content-repository/tcl/content-folder-procs.tcl
openacs-4/packages/acs-content-repository/tcl/content-folder-procs.xql
openacs-4/packages/acs-content-repository/sql/postgresql/upgrade/upgrade-5.3.0d4-5.3.0d5.sql
openacs-4/packages/acs-content-repository/sql/postgresql/upgrade/upgrade-5.3.0d6-5.3.0d7.sql

At the moment it seems that it worked well.

Thanks for the help

Collapse
Posted by Dario Roig on
lack these files:

acs-content-repository/tcl/content-type-procs.xql
acs-content-repository/tcl/content-type-procs.tcl
acs-content-repository/tcl/content-item-procs.tcl
acs-content-repository/tcl/content-revision-procs.tcl
acs-content-repository/tcl/content-revision-procs.xql
acs-content-repository/tcl/content-revision-procs-postgresql.xql
acs-content-repository/sql/postgresql/upgrade/upgrade-5.3.0d1-5.3.0d2.sql

Thanks

This bug in PG is no longer present at PG 8.2.x version, the 8.2.0 release actually fixed that. We might return to the previous state that CR was. It is recommened to upgrade to 8.2.x since you might experiment that bug in other tables, we did in acs_objects as well until the upgrade was performed.
Rocael, the thread deals with two different bugs: (a) duplicate key issue due to the pg rules semantics, and (b) the deadlock topic with share locks for the foreign keys? Is there some documentation about the "bug" and the "fix" in postgres?
is related to the deadlock, so no longer needed the explicit lock in 8.2.x series.
i just did some tests on xowiki and xotcl-core, which has its own db-queries. The test creates 20 threads, each of these creates 100 content items with 10 revisions each in an eager fashion. So every thread creates 1100 acs_objects, altogether 22000. While earlier versions created deadlocks, 8.2.0 does not.

This are great news, but this does not simplify release management. Interestingly enough, since OpenACS 5.3 seems to work under Oracle, i deduce that

db_dml lock_objects "LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE"

works under oracle as well. I would think, that in oracle, this locking is not needed at all, and in the postgres case, we should only do this test if the following returns 1:

select 1 where substring(version() from 'PostgreSQL #"[0-9]+.[0-9+]#".%' for '#') < 8.2;

I would hate to do the version check on every insert operation. not sure, what the best approach for acs-content-repository is. On the xotcl-side, the newest version of xotcl-core creates now depending on postgres+version conditionally a locking rule; so there is no runtime penalty for version checking.

I would do the following:

a) Require PG 8.2 or higher for OpenACS 5.4
b) For 5.3 keep the lock table statements in, or, alternatively, provide a parameter in acs-content-repository "OldPGP" which is turned on if upon installation (or upgrade) you find that the system is running on an Old DB version. Or store it in an nsv_array and check upon starting the AOLserver.

Interestingly I got problems with "lock table acs-objects in share row exclusive mode" on a farily busy site running against PG 8.2.3.

my tests just showed that some deadlock situations disappeared (it tests mostly busy content revision creation). As for every test, it does not prove that there are some more deadlock situations, which are not tested.

So, if someone experiences more deadlocks, a simple test case would be certainly helpful to nail this down.

oacs 5.4 can rollback the explicit lock, although the only ones that were experimenting the deadlock are here, and will be probably aware that pg 8.2.x upgrade is recommended, but still works oacs in 8.x versions.

We did test and everything is fine on this regard, now we have in production that.

I´m sure oracle did not have this silly deadlock.

Collapse
Posted by Dave Bauer on
Rocael

The new code is better and uses a function instead of the cr_dummy table. I don't see any reason to revert it. Luckily we were able to fix it by using new features in PG 8 what were not available in PG 7.

Collapse
Posted by Richard Hamilton on
Now that it has been established that the Postgres bug that caused this issue has been fixed in PG 8.2.x, there is one thing that is not clear to me from the posts above:

Does the currently committed new code include the explicit locking proposed by Rocael (LOCK TABLE acs_objects IN SHARE ROW EXCLUSIVE MODE;)?

If so, someone mentioned that this may undermine MVCC in some cases. Is this correct and if so are we sure that the new code is as good as it can be for PG versions 8.2.x and later?

Regards
RIchard

Collapse
Posted by Eduardo Santos on
Hi everybody,

In my understanding about this thread, the Deadlock problem is solved when you use PostgreSQL 8.2.X, right? So, in OpenACS 5.4 is the lock removed? Because I'm watching the HEAD version of acs-content-repository and the lock is still there. I'm running an assessment with a lot of concurrent users and this lock seems a performance problem to me.

With PostgreSQL MVCC improvements for 8.2.X, can I just coment out the lock line in assessment? Or is somebody going to fix this for the next version?

Collapse
Posted by Gustaf Neumann on
You are right, i have removed the LOCK statements in CVS head.
Collapse
Posted by Luke Pond on
Thanks Gustaf, I just updated my CR to remove those LOCK TABLE statements and wanted to report that it made a significant improvement to the assessment page load times when posting a page full of questions. Also copying a large assessment with hundreds of content items is MUCH faster now.

So I'd encorage anyone who with a newer version of Postgres to make sure those LOCK TABLE statements are not being used. Look in acs-content-repository/tcl/content-item-procs.tcl and content-revision-procs.tcl.