Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!
The following messages appears in the file below:
-----------------------------------------------------------
/var/lib/pgsql/data/pg_log/postgresql-2009-08-21_000000.log
-----------------------------------------------------------
LOG: checkpoints are occurring too frequently (16 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (13 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (10 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (15 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
--------------------------------------------------
The PostgreSQL's log file is already giving you a hint that you have to tune your PostgreSQL. You have to increase checkpoint_segments, wich is pretty common in OpenACS installs. Concerning your Hardware, I would say you should try to increase to 6 this value, but I don't know if your system can handle it. Try it and see if you can see some improvements.
I would say in a general way that you have to do some tuning in your PostgreSQL. I can see that you've increased shared_buffers, and maybe that's the reason for wich you are facing this kind of trouble. You see: both PostgreSQL and AOLServer use shared memory, controle by the shmmax parameter in the So Kernel. If you use a too big value for the PostgreSQL shared_buffers, it can start a race for the same resources with the server, causing the system to crash. It already hapenned with me.
However, I don't believe this is the problem in your case. Take a look at the queries being executed at the crash moment (usually a select * from pg_stat_acitivity would do the job) and try to find out what is happening at that time.
nccedudotlrn=# SELECT * FROM pg_stat_activity;
datid | datname | procpid | usesysid | usename | current_query | waiting | query_start | backend_start | client_addr | client_port
----------+--------------+---------+----------+--------------+--------------------------------------------------------------------------------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn | 16145 | 10 | postgres | <IDLE> | f | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 | | -1
98657666 | nccedudotlrn | 21493 | 10 | postgres | <IDLE> | f | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 | | -1
98657666 | nccedudotlrn | 21549 | 10 | postgres | SELECT * FROM pg_stat_activity; | f | 2009-08-21 15:57:56.399604+01 | 2009-08-21 13:50:55.050635+01 | | -1
98657666 | nccedudotlrn | 25390 | 16388 | nccedudotlrn | | f | 2009-08-21 15:57:52.086032+01 | 2009-08-21 15:54:32.437033+01 | | -1
: select dotlrn_communities_all.*,
: dotlrn_community__url(dotlrn_communities_all.community_id) as url,
: (CASE
: WHEN
: dotlrn_communities_all.community_type = 'dotlrn_community'
: THEN 'dotlrn_community'
: WHEN dotlrn_communities_all.community_type = 'dotlrn_club'
: THEN 'dotlrn_club'
: WHEN dotlrn_communities_all.community_type = 'dotlrn_pers_community'
: THEN 'dotlrn_pers_community'
: ELSE 'dotlrn_class_instance'
: END) as simple_community_type,
: tree_level(dotlrn_communities_all.tree_sortkey) as tree_level,
: coalesce((select tree_level(dotlrn_community_types.tree_sortkey)
: from dotlrn_community_types
: where dotlrn_community_types.community_type = dotlrn_communities_all.community_type), 0) as community_
98657666 | nccedudotlrn | 25433 | 16388 | nccedudotlrn | <IDLE> | f | 2009-08-21 15:56:59.780954+01 | 2009-08-21 15:54:32.484492+01 | | -1
98657666 | nccedudotlrn | 25436 | 16388 | nccedudotlrn | <IDLE> | f | | 2009-08-21 15:54:32.488507+01 | | -1
98657666 | nccedudotlrn | 16801 | 16388 | nccedudotlrn | <IDLE> | f | 2009-08-21 15:57:48.476462+01 | 2009-08-21 15:57:48.462369+01 | | -1
98657666 | nccedudotlrn | 19890 | 16388 | nccedudotlrn | <IDLE> | f | 2009-08-21 15:57:53.895152+01 | 2009-08-21 15:57:53.890079+01 | | -1
(8 rows)
This was after it crashed...
nccedudotlrn=# SELECT * FROM pg_stat_activity;
datid | datname | procpid | usesysid | usename | current_query | waiting | query_start | backend_start | client_addr | client_port
----------+--------------+---------+----------+----------+---------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn | 16145 | 10 | postgres | <IDLE> | f | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 | | -1
98657666 | nccedudotlrn | 21493 | 10 | postgres | <IDLE> | f | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 | | -1
98657666 | nccedudotlrn | 21549 | 10 | postgres | SELECT * FROM pg_stat_activity; | f | 2009-08-21 16:06:24.955515+01 | 2009-08-21 13:50:55.050635+01 | | -1
(3 rows)
We can't see anything out of the ordinary?
Maybe you could advise?
Here is the amended version:
-------------------------------
kernel.shmmax = 209715200
checkpoint_segments = 6
-------------------------------
Restarted the postgresql service
but that did not make any difference. System still crashed after 8 minutes