Forum OpenACS Q&A: PostgreSQL dying

Collapse
Posted by Roberto Mello on
Hi all,

I am facing this problem for the first time... I have this OpenACS 3.2.2 installation on an AMD K6-2 300 with 64 Mb of RAM. It originally had only 32 Mb, so I was having PostgreSQL start with no '-B x' or '-o "-S y"' options as Don recommends because it would swap too much.

I asked and 32 more megs were put in the box. I then set PostgreSQL with '-B 500 -o "-S 500"' so it would have some more room to work, but now PostgreSQL keeps dying on me very frequently. Apparently it brings itself up and does its own cleaning, but AOLserver (3 running nsd76) goes defunct immediately without returning an error message or anything.

What could be causing this ? Bad RAM ? Wrong settings (-B, etc...) ? And what can I do to solve this (aD's keepalive ?)?

Here's what AOLserver leaves in the logs... Any help is appreciated.

[10/Jul/2000:18:12:07][5211.9220][-conn4-] Notice: Querying 'update
clickthrough_log set click_count = click_count + 1
where local_url = 'index.adp'
and foreign_url = 'http://www.bookstore.usu.edu'
and trunc(entry_date) = trunc(sysdate());'
NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend died
abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am going to
terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
Collapse
Posted by Roberto Mello on
Sometimes it also puts this in the logs:

[10/Jul/2000:18:29:27][5675.4101][-conn0-] Notice: starting: waiting for connections
[10/Jul/2000:18:29:37][5675.4101][-conn0-] Notice: Ns_PgExec: Trying to reopen database connection
[10/Jul/2000:18:29:37][5675.4101][-conn0-] Notice: Opening exchange on localhost
[10/Jul/2000:18:29:37][5675.4101][-conn0-] Error: Ns_PgOpenDb(postgres):  Could not connect to localhost::exchange:  The Data Base System is starting up

[10/Jul/2000:18:29:37][5675.4101][-conn0-] Error: nsd.db: error(localhost::exchange, pqReadData() -- backend closed the channel unexpectedly.
        This probably means the backend terminated abnormally
        before or while processing the request.
): update users
set last_visit = sysdate()
where user_id = 3
Collapse
Posted by Roberto Mello on
Here's the followup on what happenned... I forgot to mention that PostgreSQL only became this unstable immediately AFTER I reloaded my database from a pg_dump. I had been running it with different -B and -o flags fro a couple hours already.

Interesting thing is that even with AOLserver down, going to psql and trying to delete a row from a table would cause the backend to die. So it had nothing to do with AOLserver.

So after wandering for a while, I dropdb'd and reloaded my database from scratch from the openacs datamodel... Voila ! No more instabilities... PostgreSQL has been working like a charm.

Could this mean that there's a problem with pg_dump ?

Collapse
Posted by Don Baccus on
Roberto and I have discussed this in private e-mail, and I've asked that he try to bundle up a reproducible example to shuffle off to the PG group.

In reality, one normally would do a dropdb before a full restore just to clean out the various tables.  I'm also curious as to whether or not Roberto got any errors during his reload (Roberto?).

Anyway, it sounds like one or more system tables got screwed up by the  reload, and even if reloading fails it shouldn't cause backend crashes.

Collapse
Posted by Roberto Mello on
I didn't get any errors during the reload. Everything looked just fine. At first I didn't do a dropdb, I then did one and would still be unstable.

I then recompiled my PostgreSQL (I was intrigued and thought, why not ?), initdb'd, reloaded the database, and still unstable.

The only way that I solved this was reloading the DB from scratch from the datamodel and modifying the dump with just the tables I wanted (which was a pain in the butt).

Collapse
Posted by Roberto Mello on
Ok, I redid the db loading step, this time logging the errors to a file. Here are the errors that came out. I guess I didn't notice them before.

The errors showed also that close to the end of the dump, PG says that it can't find some tables, although their definition is in the dump (e.g. survsimp_responses). Which means that it didn't create them, for what reason I don't know because they seem pretty normal to me. I think the non-creation of these tables were due to the fact that PG didn't create the function sysdate():

ERROR:  Function 'sysdate()' does not exist
        Unable to identify a function that satisfies the given argument types
        You may need to add explicit typecasts
(repeated 4 times)
ERROR:  Relation 'users_special_quotas' does not exist
.
ERROR:  Relation 'survsimp_surveys' does not exist
(repeated several times)
ERROR:  DefineIndex: relation "survsimp_surveys" not found
ERROR:  Relation 'users_special_quotas' does not exist
ERROR:  Relation 'survsimp_responses' does not exist
ERROR:  select rules target entry 16 has different size from attribute m_address
ERROR:  select rules target entry 4 has different size from attribute team_leader_p
ERROR:  select rules target entry 3 has different size from attribute on_which_table
ERROR:  select rules target entry 2 has different size from attribute salary
ERROR:  view rule for ec_subsubcategories_augmented must be named _RETec_subsubcategories_augmented
ERROR:  view rule for ec_gift_certificates_approved must be named _RETec_gift_certificates_approved
ERROR:  view rule for ec_gift_certificates_purchased must be named _RETec_gift_certificates_purchased
ERROR:  view rule for ec_fin_transactions_reportable must be named _RETec_fin_transactions_reportable
ERROR:  select rules target entry 7 has different size from attribute read_permission_p
ERROR:  select rules target entry 8 has different size from attribute email_bouncing_p
ERROR:  select rules target entry 9 has different size from attribute approved_p
ERROR:  view rule for user_group_member_field_mapping must be named _RETuser_group_member_field_mapping
Collapse
Posted by Roberto Mello on
Sorry about filling up your inboxes like this, but I think this is pretty important...

I just forgot to add that the backend continues to crash with this last reload I made.

Collapse
Posted by Roberto Mello on
It's been two weeks and I haven't seen anyone report problems similar to what I had and described above. That can that either nbbody read or payed attention to this thread or it's only happenning with me.

However, I can consistently reproduce backend crashes by simply pg_dumping and reloading the OpenACS datamodel and some data in three different machines running different versions of PostgreSQL 7.

I don't know about you all, but this scares the heck out of me. What other ways are there to backup a PG database?

Collapse
Posted by Don Baccus on
Well, pg_dump's been considerably re-written in the last few weeks, though I'm not sure it will help you.

If you can reproduce this problem as easily as you say (which I know you can, of course), the pg_hackers mailing list should hear about it.

Visit www.postgresql.org and chase links and you'll find out how to subscribe (it's a majordomo list, of course).

There are people there who do nothing but work on PG, and they'll be interested.

Collapse
Posted by Babak Ashrafi on
Since July, has there been any further insight into the postgres backend terminating abnormally? We experienced postgres crashing and then aolserver crashed (with the acs error log reporting the same Message from Postgres...backend died abnormally and possibly corrupted shared memory...), on a site that was pgdump'd and restored on another machine, right on login. -Both machines are running postgres 7.0.2, and same setup of openACS, on FreeBSD.
We did:
on the to-be-backed-up machine:
1. pg_dump -o HRS > hrs.pgdump
on the restore machine: 
2. dropdb HRS
3. createdb HRS
4. psql -e < hrs.pgdump
The echoed commands did not show errors as when using cat hrs.pdump |psql HRS (which had errors)
5. Then we restarted nsd

Q: What is the proper way to backup and restore an entire openACS site 
thanks.