Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!

Hi everybody,

I'm following this thread because this used to happen a lot in my server, mostly because some queries where too large, and both PostgreSQL and AOLServer used to run for the same resources. Some data may help to find the problem:

1 - Is the system 32 or 64 bits?
2 - Is PostgreSQL on the same box? From your log, I can assume it is.
3 - What do you see in the database when the crash happens?

Some thing very similar can happen when you have news aggregator instaled and the sources table is so big that the system can't load it into the memory, causing the the server to get a signal 11 and crash.

If you have monitoring, take a look at the scheduling procs being executed at the exact crash time. It could help.

Hi Eduardo, thank you for your response.

1 - Our system is 32 bits
2 - Yes PostgreSQL is on the same box
3 - Not too sure where the db log is...

Here is what I've found
----------------------------
/var/lib/pgsql/pgstartup.log
----------------------------
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
WARNING: could not create listen socket for "localhost"
FATAL: could not create any TCP/IP sockets
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
WARNING: could not create listen socket for "localhost"
FATAL: could not create any TCP/IP sockets
LOG: logger shutting down
LOG: logger shutting down
LOG: logger shutting down
FATAL: could not create shared memory segment: Invalid argument
DETAIL: Failed system call was shmget(key=5432001, size=140361728, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 140361728 bytes), reduce PostgreSQL's shared_buffers parameter (currently 16384) and/or its max_connections parameter (currently 100).
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration.
FATAL: could not create shared memory segment: Invalid argument
DETAIL: Failed system call was shmget(key=5432001, size=140361728, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 140361728 bytes), reduce PostgreSQL's shared_buffers parameter (currently 16384) and/or its max_connections parameter (currently 100).
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration.
LOG: logger shutting down
-----------------------------------------

ALSO...

The following messages appears in the file below:
-----------------------------------------------------------
/var/lib/pgsql/data/pg_log/postgresql-2009-08-21_000000.log
-----------------------------------------------------------
LOG: checkpoints are occurring too frequently (16 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (13 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (10 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (15 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
--------------------------------------------------

Hi Shahid,

The PostgreSQL's log file is already giving you a hint that you have to tune your PostgreSQL. You have to increase checkpoint_segments, wich is pretty common in OpenACS installs. Concerning your Hardware, I would say you should try to increase to 6 this value, but I don't know if your system can handle it. Try it and see if you can see some improvements.

I would say in a general way that you have to do some tuning in your PostgreSQL. I can see that you've increased shared_buffers, and maybe that's the reason for wich you are facing this kind of trouble. You see: both PostgreSQL and AOLServer use shared memory, controle by the shmmax parameter in the So Kernel. If you use a too big value for the PostgreSQL shared_buffers, it can start a race for the same resources with the server, causing the system to crash. It already hapenned with me.

However, I don't believe this is the problem in your case. Take a look at the queries being executed at the crash moment (usually a select * from pg_stat_acitivity would do the job) and try to find out what is happening at that time.

How do we look at the scheduling procs being executed?
To see the scheduled procs, install the monitoring package (http://fisheye.openacs.org/browse/OpenACS/openacs-4/packages/monitoring)
Here is what was being executed just before the crash...

nccedudotlrn=# SELECT * FROM pg_stat_activity;
  datid  |  datname    | procpid | usesysid |  usename    |                                  current_query                  | waiting |          query_start          |        backend_start      | client_addr | client_port
----------+--------------+---------+----------+--------------+--------------------------------------------------------------------------------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn |  16145 |      10 | postgres    | <IDLE>                    | f      | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 |            |          -1
98657666 | nccedudotlrn |  21493 |      10 | postgres    | <IDLE>                    | f      | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 |            |          -1
98657666 | nccedudotlrn |  21549 |      10 | postgres    | SELECT * FROM pg_stat_activity;                  | f      | 2009-08-21 15:57:56.399604+01 | 2009-08-21 13:50:55.050635+01 |            |          -1
98657666 | nccedudotlrn |  25390 |    16388 | nccedudotlrn |                    | f      | 2009-08-21 15:57:52.086032+01 | 2009-08-21 15:54:32.437033+01 |            |          -1
                                                            :            select dotlrn_communities_all.*,
                                                            :                    dotlrn_community__url(dotlrn_communities_all.community_id) as url,
                                                            :                    (CASE
                                                            :                      WHEN
                                                            :  dotlrn_communities_all.community_type = 'dotlrn_community'
                                                            :                      THEN 'dotlrn_community'
                                                            :                      WHEN dotlrn_communities_all.community_type = 'dotlrn_club'
                                                            :                      THEN 'dotlrn_club'
                                                            :                      WHEN dotlrn_communities_all.community_type = 'dotlrn_pers_community'
                                                            :                      THEN 'dotlrn_pers_community'
                                                            :                      ELSE 'dotlrn_class_instance'
                                                            :                    END) as simple_community_type,
                                                            :                    tree_level(dotlrn_communities_all.tree_sortkey) as tree_level,
                                                            :                    coalesce((select tree_level(dotlrn_community_types.tree_sortkey)
                                                            :  from dotlrn_community_types
                                                            :  where dotlrn_community_types.community_type = dotlrn_communities_all.community_type), 0) as community_
98657666 | nccedudotlrn |  25433 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:56:59.780954+01 | 2009-08-21 15:54:32.484492+01 |            |          -1
98657666 | nccedudotlrn |  25436 |    16388 | nccedudotlrn | <IDLE>                    | f      |                              | 2009-08-21 15:54:32.488507+01 |            |          -1
98657666 | nccedudotlrn |  16801 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:57:48.476462+01 | 2009-08-21 15:57:48.462369+01 |            |          -1
98657666 | nccedudotlrn |  19890 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:57:53.895152+01 | 2009-08-21 15:57:53.890079+01 |            |          -1
(8 rows)

This was after it crashed...

nccedudotlrn=# SELECT * FROM pg_stat_activity;
  datid  |  datname    | procpid | usesysid | usename  |          current_query      | waiting |          query_start          |        backend_start        | client_addr | client_port
----------+--------------+---------+----------+----------+---------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn |  16145 |      10 | postgres | <IDLE>      | f      | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 |        |          -1
98657666 | nccedudotlrn |  21493 |      10 | postgres | <IDLE>      | f      | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 |        |          -1
98657666 | nccedudotlrn |  21549 |      10 | postgres | SELECT * FROM pg_stat_activity; | f      | 2009-08-21 16:06:24.955515+01 | 2009-08-21 13:50:55.050635+01 |        |          -1
(3 rows)

We can't see anything out of the ordinary?
Maybe you could advise?

Hi, we edited the /etc/sysctl.conf

Here is the amended version:
-------------------------------
kernel.shmmax = 209715200

checkpoint_segments = 6
-------------------------------

Restarted the postgresql service
but that did not make any difference. System still crashed after 8 minutes