Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!

Hi Shahid,

The PostgreSQL's log file is already giving you a hint that you have to tune your PostgreSQL. You have to increase checkpoint_segments, wich is pretty common in OpenACS installs. Concerning your Hardware, I would say you should try to increase to 6 this value, but I don't know if your system can handle it. Try it and see if you can see some improvements.

I would say in a general way that you have to do some tuning in your PostgreSQL. I can see that you've increased shared_buffers, and maybe that's the reason for wich you are facing this kind of trouble. You see: both PostgreSQL and AOLServer use shared memory, controle by the shmmax parameter in the So Kernel. If you use a too big value for the PostgreSQL shared_buffers, it can start a race for the same resources with the server, causing the system to crash. It already hapenned with me.

However, I don't believe this is the problem in your case. Take a look at the queries being executed at the crash moment (usually a select * from pg_stat_acitivity would do the job) and try to find out what is happening at that time.

Here is what was being executed just before the crash...

nccedudotlrn=# SELECT * FROM pg_stat_activity;
  datid  |  datname    | procpid | usesysid |  usename    |                                  current_query                  | waiting |          query_start          |        backend_start      | client_addr | client_port
----------+--------------+---------+----------+--------------+--------------------------------------------------------------------------------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn |  16145 |      10 | postgres    | <IDLE>                    | f      | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 |            |          -1
98657666 | nccedudotlrn |  21493 |      10 | postgres    | <IDLE>                    | f      | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 |            |          -1
98657666 | nccedudotlrn |  21549 |      10 | postgres    | SELECT * FROM pg_stat_activity;                  | f      | 2009-08-21 15:57:56.399604+01 | 2009-08-21 13:50:55.050635+01 |            |          -1
98657666 | nccedudotlrn |  25390 |    16388 | nccedudotlrn |                    | f      | 2009-08-21 15:57:52.086032+01 | 2009-08-21 15:54:32.437033+01 |            |          -1
                                                            :            select dotlrn_communities_all.*,
                                                            :                    dotlrn_community__url(dotlrn_communities_all.community_id) as url,
                                                            :                    (CASE
                                                            :                      WHEN
                                                            :  dotlrn_communities_all.community_type = 'dotlrn_community'
                                                            :                      THEN 'dotlrn_community'
                                                            :                      WHEN dotlrn_communities_all.community_type = 'dotlrn_club'
                                                            :                      THEN 'dotlrn_club'
                                                            :                      WHEN dotlrn_communities_all.community_type = 'dotlrn_pers_community'
                                                            :                      THEN 'dotlrn_pers_community'
                                                            :                      ELSE 'dotlrn_class_instance'
                                                            :                    END) as simple_community_type,
                                                            :                    tree_level(dotlrn_communities_all.tree_sortkey) as tree_level,
                                                            :                    coalesce((select tree_level(dotlrn_community_types.tree_sortkey)
                                                            :  from dotlrn_community_types
                                                            :  where dotlrn_community_types.community_type = dotlrn_communities_all.community_type), 0) as community_
98657666 | nccedudotlrn |  25433 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:56:59.780954+01 | 2009-08-21 15:54:32.484492+01 |            |          -1
98657666 | nccedudotlrn |  25436 |    16388 | nccedudotlrn | <IDLE>                    | f      |                              | 2009-08-21 15:54:32.488507+01 |            |          -1
98657666 | nccedudotlrn |  16801 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:57:48.476462+01 | 2009-08-21 15:57:48.462369+01 |            |          -1
98657666 | nccedudotlrn |  19890 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:57:53.895152+01 | 2009-08-21 15:57:53.890079+01 |            |          -1
(8 rows)

This was after it crashed...

nccedudotlrn=# SELECT * FROM pg_stat_activity;
  datid  |  datname    | procpid | usesysid | usename  |          current_query      | waiting |          query_start          |        backend_start        | client_addr | client_port
----------+--------------+---------+----------+----------+---------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn |  16145 |      10 | postgres | <IDLE>      | f      | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 |        |          -1
98657666 | nccedudotlrn |  21493 |      10 | postgres | <IDLE>      | f      | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 |        |          -1
98657666 | nccedudotlrn |  21549 |      10 | postgres | SELECT * FROM pg_stat_activity; | f      | 2009-08-21 16:06:24.955515+01 | 2009-08-21 13:50:55.050635+01 |        |          -1
(3 rows)

We can't see anything out of the ordinary?
Maybe you could advise?

Hi, we edited the /etc/sysctl.conf

Here is the amended version:
-------------------------------
kernel.shmmax = 209715200

checkpoint_segments = 6
-------------------------------

Restarted the postgresql service
but that did not make any difference. System still crashed after 8 minutes