Forum OpenACS Development: When OpenACS starts to behave weird after system upgrades

Dear all,

In the last days, one of our sites was struck by a problem, which was not easy to debug. One of the manifestations of this bug was, that suddenly the link like "/dotlrn/classes/math/xowiki/" could not be resolved any more (resolving the link returned the package "math" and not "xowiki'). Interestingly, a listing of all the child nodes of "math" included 'xowiki". The reason for the wrong resolution of the path was in essence the query

select * from site_nodes where name = 'xowiki' and parent_id = 41798;

which did not find a tuple. The following queries succeeded:

select * from site_nodes where name LIKE 'xowiki%' and parent_id = 41798;
select * from site_nodes where name LIKE '%xowiki' and parent_id = 41798;

These are symptoms of a broken PostgreSQL index. After a "reindex" operation, the query was fine again.

# select * from site_nodes where name = 'xowiki' and parent_id = 41798;
 node_id | parent_id | name | directory_p | pattern_p | object_id 
---------+-----------+------+-------------+-----------+-----------
(0 rows)

# REINDEX INDEX CONCURRENTLY site_nodes_un;
REINDEX

# select name from site_nodes where parent_id = 41798 and name = 'xowiki';
 name  
-------
 xowiki
(1 row)

The problem turned out to be a consequence of an upgrade of Ubuntu 18.04.6 LTS to Ubuntu 20.04.4 LTS, where libc was upgraded from 2.27 to 2.31, which lead to different collation orders. So, e.g. the result of a comparison of the strings "a-a" and "a+a" flipped in glibc 2.28. So, this is not Ubuntu specific, but will effect also all other Linux versions upgrading glibc.

A consequence of the changed collation order is that that sorted indices (stored on dik) are suddenly not sorted anymore, and index mechanisms depending on the sort order (like a BTree Index), will fail, leading to problems like lookup failures, where items (content, users, email-addresses, ...) cannot not be found/inserted/deleted, etc. - without having anyone changing Application/OpenACS/Postgres/...

In essence, every string based (non-numeric) index can be broken after such an update. The fix is to reindex the broken indices (or a full dump+restore). Fortunately, the postgres operation "REINDEX ...CONCURRENTLY ..." (see example above) allows to reindex tables with minimal locks, such these can be used also on productive environments.

Early symptoms:

Watch out for errors of the form

ERROR:  table tid from new index tuple (...,...) cannot find insert offset between offsets . and . of block ... in index "..." 
ERROR:  posting list tuple with ... items cannot be split at offset ...
ERROR:  failed to re-find parent key in index "..." for deletion target page ....

If you see such errors, use the wonderful extension "amcheck" which can check the consistency of PostgreSQL indices:

create extension if not exists amcheck;
SELECT bt_index_check(oid) FROM pg_class WHERE relname='...';

Actually, these problems are not PostgreSQL specific, but can hit every (database) system depending on a sort order.

For more details, see: https://www.citusdata.com/blog/2020/12/12/dont-let-collation-versions-corrupt-your-postgresql-indexes/

all the best
-g