Forum OpenACS Q&A: Fatal: received fatal signal 11 - new error after years!

Request notifications

Hi,

Can someone please help?!

We've had OpenACS 5.3.2 running on Aolserver 4.0.1 and postgres 8.2.5 with no problems for about 2 years. Then last night for some reason the server crashed with an error message Fatal: received fatal signal 11. We haven't changed anything and now when we start up the server it crashes with the same error code after only a few minutes.

Does anyone know what is causing this or how we can fix it as we don't know where to start other than falling back to a week old backup.

Cheers

I've also noticed the exact log entry is:

[19281.3074333616][-sched] Fatal: received fatal signal 11

What do the numbers mean and is it a scheduled proc or something causing the bother?

Also, we have a failover server which automatically restores a backup from the live server and this has also broken today with the same error so I'm assuming it's some sort of data problem?

Cheers

Signal 11 can suggest bad RAM, or perhaps running out of memory on a process.

Any code or configuration changes for the server lately that may have reduced available memory for aolserver or perhaps requires aolserver to run with a larger stacksize specified in config.tcl?

What stacksize is specified in the config.tcl?

Do the events happen at about the same time of the day or interval? Look at the timestamp for the line just above the error to get an idea when it stopped. Timing patterns can indicate if scheduled process may be triggering and which one.

Check a few lines back in the error.log to get an idea of what was happening before each case. The number at the beginning of each line identifies a thread/process, so numerous lines may be associated with the same activity.

Can you post more lines from the error.log, say 30+ lines?

Also, sometimes there are wildly large https requests (megabytes) thrown at a form apparently trying to break the server process. I guess it's possible if the form's process is not prepared to handle it. Check (but don't post) server.log to rule this out and to get another perspective of what user requests were in progress.

HTH,

Torben

Hi I'm responding in place of Matthew as he is away today. Thank you for your response Torben.

Here are some of the lines from the error.log

********************************************

[19/Aug/2009:17:28:33][19281.2935122864][-sched:25-] Error: Transaction aborted: Database operation "0or1row" failed (exception NSINT, "Query returned more than one row.")

SQL:

select user_id, first_names || ' ' || last_name as user_name
from cc_users
where email = ''

Database operation "0or1row" failed (exception NSINT, "Query returned more than one row.")

SQL:

select user_id, first_names || ' ' || last_name as user_name
from cc_users
where email = ''


while executing
"ns_pg_bind 0or1row nsdb0 {

select user_id, first_names || ' ' || last_name as user_name
from cc_users
where email = :email

..."
("uplevel" body line 1)
invoked from within
"uplevel $ulevel [list ns_pg_bind $type $db $sql]"
("postgresql" arm line 2)
invoked from within
"switch $driverkey {
oracle {
return [uplevel $ulevel [list ns_ora $type $db $sql] $args]
}
..."
invoked from within
"db_exec 0or1row $db $full_statement_name $sql"
invoked from within
"set selection [db_exec 0or1row $db $full_statement_name $sql]"
("uplevel" body line 2)
invoked from within
"uplevel 1 $code_block "
invoked from within
"db_with_handle -dbn $dbn db {
set selection [db_exec 0or1row $db $full_statement_name $sql]
}"
(procedure "db_0or1row" line 50)
invoked from within
"db_0or1row get_user_name_and_id """
(procedure "get_address_array" line 25)
invoked from within
"get_address_array -addresses [string map {\n "" \r ""} $to_addr]"
(procedure "acs_mail_lite::send" line 10)
invoked from within
"acs_mail_lite::send -to_addr $email -from_addr $from_email -subject $subject -body $content -extraheaders $extra_headers"
(procedure "notification::email::send" line 46)
invoked from within
"notification::email::send $from_user_id $to_user_id $reply_object_id $notification_type_id $subject $content_text $content_html"
(procedure "AcsSc.notificationdeliverymethod.send.notification_email" line 1)
invoked from within
"AcsSc.notificationdeliverymethod.send.notification_email 4000813 799119 {} 692 {File Storage Notification} {} {Notification for: File-Storage: New Fil..."
("uplevel" body line 1)
invoked from within
"uplevel $func_and_args"
(procedure "apply" line 3)
invoked from within
"apply $proc_name $arguments"
(procedure "acs_sc_call" line 6)
invoked from within
"acs_sc_call NotificationDeliveryMethod Send $args $impl_key"
(procedure "notification::delivery::send" line 16)
invoked from within
"notification::delivery::send -from_user_id [ns_set get $notif notif_user] -to_user_id [ns_set get $notif user_id] -notification_type_id [ns_set get..."
("uplevel" body line 3)
invoked from within
"uplevel 1 $transaction_code "
(procedure "db_transaction" line 39)
invoked from within
"db_transaction {
# Send it
notification::delivery::send -from_user_id [ns_set get $notif notif_user] -to_use..."
(procedure "notification::sweep::sweep_notifications" line 107)
invoked from within
"notification::sweep::sweep_notifications -interval_id 635 -batched_p 0"
("eval" body line 1)
invoked from within
"eval [concat [list $proc] $args]"
(procedure "ad_run_scheduled_proc" line 46)
invoked from within
"ad_run_scheduled_proc {t f 60 notification::sweep::sweep_notifications {-interval_id 635 -batched_p 0} 1250698816 0 f}"
[19/Aug/2009:17:28:49][19281.3074333616][-sched-] Fatal: received fatal signal 11

************************************

We have the following stacksize in our config.tcl

**************************
ns_section ns/threads
ns_param mutexmeter true ;# measure lock contention
# The per-thread stack size must be a multiple of 8k for AOLServer to run under MacOS X
ns_param stacksize [expr 128 * 8192]
**************************

Shahid Butt,
Looks like a couple of things are happening.

First, there's a query error related to a scheduled notification, probably repeatedly trying since it is failing..

Second, there may be a backlog of scheduled events that are using up available memory and causing the signal 11 (just a guess).

You'll want to fix the query error. Its cause appears to be two users with the same email address, where the query expects users to have unique emails. (I edited out the particular email that was posted here, but you can see it in your error.log version).
cheers,

Torben

I'm also replying in Matthews absence.
Thanks for the reply Torben.

It is not happening at a certain time of day, it happens around 10 minutes after a server restart. The site is up and available within this 10 minute window.

We checked the server.log whilst the site was up and I loaded the login page, nothing out of the ordinary was happening just requests for images/css which completed successfully. The web page loaded fully and was static as the fatal error occurred. (no requests were being made at the time in the server.log).

We increased the stacksize from...
ns_param stacksize [expr 128 * 8192], to
ns_param stacksize [expr 512 * 8192]
and the restarted, but still no joy.

We will have a go at removing the duplicate email addresses and restart, how do we clear the backlog of scheduled events?

Thanks,

Andy

You may have to dig into the scheduling procs to see where and how the stack is kept and thus how to modulate the load. Or, wait for someone else here to have a quick answer for you.

On the bright side, the server may process the stack without crashing once all blocking errors are fixed. Hopefully there is just the one error.

With regards to the duplicate email addresses in the error.log, our system has been modified to accept this, so this is not an issue for us, as we have been running like this for a few years now.

We have restored a backup of the database that was 1 week old but that too failed after running for 8 minutes with the same error message.

The server has 4GB of RAM. We are thinking of adding more RAM into it to see if this helps. But if the lack of memory was a problem wouldn't it just run more slowly rather than just fall over after 5-8 minutes?

If anyone has any further advice on what we can look into it would be a great help thanks.
Shahid

If there are errors in the log, be sure to fix them. Depending on the nature of the error(s), they may be adding to the instability of the server.

Regarding the previously discussed error, perhaps this is the first case of two users with same email address requesting notification of the same item? Anyway, the modification is not complete so long as that bug exists.

Per default, the scheduled procedure sweep_notifications runs every 10 minutes. That explains the repeating behavior.

concerning the crash: it looks like your server is running out of addressable memory per process (typically 2GB for 32bit processes). All threads share the same address space of the process. If the process runs close to 2GB, and a thread needs more memory, the OS says "no" and kills it. Running out of this limit can have multiple origins; often, calls to external programs can lead to this problem. Adding more physical memory only helps against running out of memory on the OS-level, including swap (rather uncommon, depends what is running aside openacs on the system), but it does not change the 2GB limit. To overcome it, one could compile the server + tcl + modules with 64bit, but in your case, i would not recommend it, since it won't address the cause of the problem.

In your case, i have the following suspicion: Since you wrote that your system has been modified to accept duplicate email addresses a while ago, i would expect that the delivery of notifications is broken since this time. This means that, since this time, notifications are piling up. If there is a huge number of notifications waiting for delivery, and these are loaded into memory, you might run out of resources.

Check in psql how many requests are waiting: select count(*) from notifications

The number should be low, typically 0.
To understand the notification sweeper, check:

http://openacs.org/api-doc/proc-view?proc=notification::sweep::sweep_notifications&source_p=1&version_id=
http://openacs.org/api-doc/proc-view?proc=notification::new&source_p=1&version_id=

Hope, this helps a little.

Hi, Thanks for the reply Gustav.

You were correct, there were over 4000 rows in the notifications table, we removed them and restarted.
The fatal error still occurred though. We re-checked the notifications table after the error and it was zero.

Is there a way to clear all notifications properly?

here are the last few lines of the error.log before the FATAL message...

[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code blocks...
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code block filters_register in /var/lib/aolserver/nccedudotlrn/packages/acs-tc l/tcl/request-processor-init.tcl
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 0 0 1 throttle {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth POST * rp_invoke_filter 1 0 1 throttle {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 2 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 3 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 4 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 5 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e GET /* rp_invoke_filter 6 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e POST /* rp_invoke_filter 7 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e HEAD /* rp_invoke_filter 8 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code block procs_register in /var/lib/aolserver/nccedudotlrn/packages/acs-tcl/ tcl/request-processor-init.tcl
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_proc GET / grades-sheet-csv* rp_invoke_proc {0 0 0 evaluation::generate_grades_sheet {}}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_proc POST /grades-sheet-csv* rp_invoke_proc {0 0 0 evaluation::generate_grades_sheet {}}
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nsmain: AOLserver/4.0.1 0 running
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nsmain: security info: uid=502, euid=502, gid=501, egid=501
[21/Aug/2009:12:31:02][16716.3074354096][-sched-] Notice: sched: starting
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nssock: listening on 19 2.168.0.7:80
[21/Aug/2009:12:31:02][16716.2847898544][-driver-] Notice: starting
[21/Aug/2009:12:31:02][16716.2847898544][-driver-] Notice: driver: accepting con nections
[21/Aug/2009:12:31:36][16716.2843700144][-sched:idle0-] Notice: starting
[21/Aug/2009:12:35:33][16716.2930785200][-conn:nccedudotlrn::1] Notice: random: generating 1 seed
[21/Aug/2009:12:35:34][16716.2930785200][-conn:nccedudotlrn::1] Notice: --create d new persistent ::xotcl::THREAD as tid0xa8ff4bb0 pid=16716, ::throttle ::xotcl: :THREAD->do (1936ms)
[21/Aug/2009:12:35:34][16716.2835303344][-tclthread-] Notice: dbdrv: opening dat abase 'postgres:localhost::nccedudotlrn'
[21/Aug/2009:12:35:34][16716.2835303344][-tclthread-] Notice: Opening nccedudotl rn on localhost
[21/Aug/2009:12:35:35][16716.2835303344][-tclthread-] Notice: Ns_PgOpenDb(postgr es): Openned connection to localhost::nccedudotlrn.
[21/Aug/2009:12:35:35][16716.2835303344][-tclthread-] Notice: +++ package_id of xotcl-request-monitor is 2576709
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: dbdrv: opening data base 'postgres:localhost::nccedudotlrn'
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: Opening nccedudotlr n on localhost
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: Ns_PgOpenDb(postgre s): Openned connection to localhost::nccedudotlrn.
[21/Aug/2009:12:35:41][16716.2839501744][-sched:idle1-] Notice: starting
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property title instead of doc(title).
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property header_stuff instead of head.
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: property focus is deprecated in blank-master - focus should be handled in site-master.
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property title instead of doc(title).
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property header_stuff instead of head.
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: property focus is deprecated in blank-master - focus should be handled in site-master.
[21/Aug/2009:12:35:57][16716.2939182000][-conn:nccedudotlrn::13] Notice: --start ing... /campus/home/ , ::3551196 -> (1211ms)
[21/Aug/2009:12:35:57][16716.2939182000][-conn:nccedudotlrn::13] Notice: --try i ndex -> 0, ::3551196 ::xowiki::Package->resolve_request (1215ms, 3ms)
[21/Aug/2009:12:36:09][16716.2939182000][-conn:nccedudotlrn::13] Warning: db_exe c: longdb 11 seconds nsdb0 select dbqd.xowiki.www.portlets.dotlrn-main-portlet.s elect_communities
[21/Aug/2009:12:36:10][16716.2939182000][-conn:nccedudotlrn::13] Notice: --after options, ::3553858 ::xowiki::Page->view (14276ms, 13061ms)
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Warning: blank- compat: /var/lib/aolserver/nccedudotlrn/packages/xowiki/www/index.vuh uses depre cated property title instead of doc(title).
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Warning: blank- compat: /var/lib/aolserver/nccedudotlrn/packages/xowiki/www/index.vuh uses depre cated property header_stuff instead of head.
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Notice: --i ::3 551196 DONE, ::3551196 -> (18176ms, 3899ms)
[21/Aug/2009:12:39:06][16716.2943380400][-conn:nccedudotlrn::12] Notice: --start ing... /campus/home/ , ::3551196 -> (847ms)
[21/Aug/2009:12:39:06][16716.2943380400][-conn:nccedudotlrn::12] Notice: --try i ndex -> 0, ::3551196 ::xowiki::Package->resolve_request (852ms, 4ms)
[21/Aug/2009:12:39:10][16716.3074354096][-sched-] Fatal: received fatal signal 1 1

Thanks

Hi Gustaf,
We deleeted all the records from the NOTIFICATIONS table (around 400 records) and also deleted all the records from NOTICIFICATION_REQUESTS table (around 65,000 recs).

Restarted the service but received the fatal signal 11 error again.

Is there a proper procedure for clearing the notifications, rather than just deleting the records from the 2 tables I've mentioned?

Thanks,
Shahid

4000 notification entries are not so much to cause any serious trouble, couple of millions might be one.

by looking closer at the log, i see that the crash happens in -sched- (the master scheduler) and not in a scheduled client thread, so the hypothesis with notifications is getting unlikely.

how much memory does nsd occupy after startup (or better, before the crash)? check with ps.

my current guess is that the master scheduler tries to start a client scheduler thread and crashes, but how comes that this happens suddenly, without changing anything on the system? Are you sure, that nobody changed the configuration or updated some shared libraries?
E.g. when one is using some common installed libraries for nsd ot its modules, an update might kill nsd (e.g. when someone replaces libtcl). Check your libraries with "ldd .../nsd".

Checking the core file, where the crash happens, might give as well a hint.

ok, thanks we will look into that now. As far as we are aware nobody has changed the configuration or updated anything.

On startup, nsd occupies around 18.0
I was also watching the top command whilst it crashed and nsd did not get above 20.0

Another problem we have had with nsd for years now, is what seems to be some sort of memory leak. What happens is nsd memory usage will creep up over say 2 weeks, reaching around 65.0 and then the server crashes.
This then requires us to restart the service, where it starts up and nsd is using around 18.0 again, then the loop continues.
We have never been able to identify what causes this.

could this be a contributing factor?

Thanks,
Andy

Could there be a possible corruption with the AOLServer application? Do we need to rebuild it?
Hi everybody,

I'm following this thread because this used to happen a lot in my server, mostly because some queries where too large, and both PostgreSQL and AOLServer used to run for the same resources. Some data may help to find the problem:

1 - Is the system 32 or 64 bits?
2 - Is PostgreSQL on the same box? From your log, I can assume it is.
3 - What do you see in the database when the crash happens?

Some thing very similar can happen when you have news aggregator instaled and the sources table is so big that the system can't load it into the memory, causing the the server to get a signal 11 and crash.

If you have monitoring, take a look at the scheduling procs being executed at the exact crash time. It could help.

Hi Eduardo, thank you for your response.

1 - Our system is 32 bits
2 - Yes PostgreSQL is on the same box
3 - Not too sure where the db log is...

Here is what I've found
----------------------------
/var/lib/pgsql/pgstartup.log
----------------------------
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
WARNING: could not create listen socket for "localhost"
FATAL: could not create any TCP/IP sockets
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
WARNING: could not create listen socket for "localhost"
FATAL: could not create any TCP/IP sockets
LOG: logger shutting down
LOG: logger shutting down
LOG: logger shutting down
FATAL: could not create shared memory segment: Invalid argument
DETAIL: Failed system call was shmget(key=5432001, size=140361728, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 140361728 bytes), reduce PostgreSQL's shared_buffers parameter (currently 16384) and/or its max_connections parameter (currently 100).
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration.
FATAL: could not create shared memory segment: Invalid argument
DETAIL: Failed system call was shmget(key=5432001, size=140361728, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 140361728 bytes), reduce PostgreSQL's shared_buffers parameter (currently 16384) and/or its max_connections parameter (currently 100).
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration.
LOG: logger shutting down
-----------------------------------------

ALSO...

The following messages appears in the file below:
-----------------------------------------------------------
/var/lib/pgsql/data/pg_log/postgresql-2009-08-21_000000.log
-----------------------------------------------------------
LOG: checkpoints are occurring too frequently (16 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (13 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (10 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (15 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
--------------------------------------------------

Hi Shahid,

The PostgreSQL's log file is already giving you a hint that you have to tune your PostgreSQL. You have to increase checkpoint_segments, wich is pretty common in OpenACS installs. Concerning your Hardware, I would say you should try to increase to 6 this value, but I don't know if your system can handle it. Try it and see if you can see some improvements.

I would say in a general way that you have to do some tuning in your PostgreSQL. I can see that you've increased shared_buffers, and maybe that's the reason for wich you are facing this kind of trouble. You see: both PostgreSQL and AOLServer use shared memory, controle by the shmmax parameter in the So Kernel. If you use a too big value for the PostgreSQL shared_buffers, it can start a race for the same resources with the server, causing the system to crash. It already hapenned with me.

However, I don't believe this is the problem in your case. Take a look at the queries being executed at the crash moment (usually a select * from pg_stat_acitivity would do the job) and try to find out what is happening at that time.

How do we look at the scheduling procs being executed?
To see the scheduled procs, install the monitoring package (http://fisheye.openacs.org/browse/OpenACS/openacs-4/packages/monitoring)
Here is what was being executed just before the crash...

nccedudotlrn=# SELECT * FROM pg_stat_activity;
  datid  |  datname    | procpid | usesysid |  usename    |                                  current_query                  | waiting |          query_start          |        backend_start      | client_addr | client_port
----------+--------------+---------+----------+--------------+--------------------------------------------------------------------------------------------------------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn |  16145 |      10 | postgres    | <IDLE>                    | f      | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 |            |          -1
98657666 | nccedudotlrn |  21493 |      10 | postgres    | <IDLE>                    | f      | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 |            |          -1
98657666 | nccedudotlrn |  21549 |      10 | postgres    | SELECT * FROM pg_stat_activity;                  | f      | 2009-08-21 15:57:56.399604+01 | 2009-08-21 13:50:55.050635+01 |            |          -1
98657666 | nccedudotlrn |  25390 |    16388 | nccedudotlrn |                    | f      | 2009-08-21 15:57:52.086032+01 | 2009-08-21 15:54:32.437033+01 |            |          -1
                                                            :            select dotlrn_communities_all.*,
                                                            :                    dotlrn_community__url(dotlrn_communities_all.community_id) as url,
                                                            :                    (CASE
                                                            :                      WHEN
                                                            :  dotlrn_communities_all.community_type = 'dotlrn_community'
                                                            :                      THEN 'dotlrn_community'
                                                            :                      WHEN dotlrn_communities_all.community_type = 'dotlrn_club'
                                                            :                      THEN 'dotlrn_club'
                                                            :                      WHEN dotlrn_communities_all.community_type = 'dotlrn_pers_community'
                                                            :                      THEN 'dotlrn_pers_community'
                                                            :                      ELSE 'dotlrn_class_instance'
                                                            :                    END) as simple_community_type,
                                                            :                    tree_level(dotlrn_communities_all.tree_sortkey) as tree_level,
                                                            :                    coalesce((select tree_level(dotlrn_community_types.tree_sortkey)
                                                            :  from dotlrn_community_types
                                                            :  where dotlrn_community_types.community_type = dotlrn_communities_all.community_type), 0) as community_
98657666 | nccedudotlrn |  25433 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:56:59.780954+01 | 2009-08-21 15:54:32.484492+01 |            |          -1
98657666 | nccedudotlrn |  25436 |    16388 | nccedudotlrn | <IDLE>                    | f      |                              | 2009-08-21 15:54:32.488507+01 |            |          -1
98657666 | nccedudotlrn |  16801 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:57:48.476462+01 | 2009-08-21 15:57:48.462369+01 |            |          -1
98657666 | nccedudotlrn |  19890 |    16388 | nccedudotlrn | <IDLE>                    | f      | 2009-08-21 15:57:53.895152+01 | 2009-08-21 15:57:53.890079+01 |            |          -1
(8 rows)

This was after it crashed...

nccedudotlrn=# SELECT * FROM pg_stat_activity;
  datid  |  datname    | procpid | usesysid | usename  |          current_query      | waiting |          query_start          |        backend_start        | client_addr | client_port
----------+--------------+---------+----------+----------+---------------------------------+---------+-------------------------------+-------------------------------+-------------+-------------
98657666 | nccedudotlrn |  16145 |      10 | postgres | <IDLE>      | f      | 2009-08-21 13:46:37.414982+01 | 2009-08-21 12:15:27.518698+01 |        |          -1
98657666 | nccedudotlrn |  21493 |      10 | postgres | <IDLE>      | f      | 2009-08-21 13:49:43.725297+01 | 2009-08-21 13:47:02.859488+01 |        |          -1
98657666 | nccedudotlrn |  21549 |      10 | postgres | SELECT * FROM pg_stat_activity; | f      | 2009-08-21 16:06:24.955515+01 | 2009-08-21 13:50:55.050635+01 |        |          -1
(3 rows)

We can't see anything out of the ordinary?
Maybe you could advise?

Hi, we edited the /etc/sysctl.conf

Here is the amended version:
-------------------------------
kernel.shmmax = 209715200

checkpoint_segments = 6
-------------------------------

Restarted the postgresql service
but that did not make any difference. System still crashed after 8 minutes

what units are "18.0" and "20.0"?
check the size with ps aux and take the value from VSZ.

The "memory leak" is rather "memory fragmentation", which is one of the reasons why busy sites should consider a daily restart.

have you checked the used libraries with ldd? What is your host operating system?

have you checked in the core dump file, where the crash happens? use gdb, see e.g. http://www.ibm.com/developerworks/library/l-gdb/.

It is true that you postgres configuration should be tuned, but that won't crash the aolserver. The error is a segmentation violation, which means that the process tries to access memory not part of its accessible memory. In most cases, this is due to an erratic pointer in a C-Program, but in general, the error can be triggered by multiple causes. Check the following thread and FAQ:
http://lists.debian.org/debian-user/1997/04/msg01588.html
http://www.bitwizard.nl/sig11/

If you are really confident that onbody has changed shared libraries etc. then it might be the hardware. If you have the option, move the installation to a different hardware and test it there. It is certainly worth to run extensive hardware checks.

-gustaf neumann

Matthew, to re-emphasize something Gustaf already mentioned: Look at your AOLserver core file with gdb. Get the backtrace and see what function it crashed in. This may or may not be useful, but it's essentially always one of the first things to check when any program crashes. I would typically check that well before much of the other stuff you've been digging into.
Hi all,
Thanks for the replies,

I would like to look into this, however, I have never used gdb before and have no idea where the AOLserver core dump file is located?

Could you provide any further info on how to do this?

Thanks,

Andy

The core dumps can be activated/deactivated via "ulimit" (if you use bash). Try ulimit -a on your system. Most likely, you will see a "core file size" of 0, which means "please, no core dumps"; set it to "unlimited" in the shell script, where you start the server. Then restart the server and you will get a core dump in the directory, where the scripts runs (on doubts, do in the script a "cd SOMEPATH" before invoking nsd). Once you have the core file, check the link provided earlier how to analyze the core dump using gdb.
This has just got a whole lot more bizarre.

We have a development machine which was working fine before which has now just started having the exact same problem. The development machine is probably 6 months old data and has had very little done to it.

could there possibly be some sort of expiration date

We have built an entirely new server and restored from an older backup and that is still giving us a Fatal Error 11.

Going to look into the core dump diagnostics now.

Cheers for all of the help so far.

Sorry, I didn't finish this sentence.

Could there possibly be some sort of expiration date on a notification being sent or something not happening which is causing a problem? The only pattern I can see between the live and development machines is that a similar amount of time has passed!

>Some thing very similar can happen when you have news aggregator instaled and the sources table is so big that the system can't load it into the memory, causing the the server to get a signal 11 and crash.

Just saw this and have had a look. We only had two entries in the na_sources table so I have removed them both along with all entries in the na_items table. I thought this may go with my theory that the only factor that seems consistent is the passing of time!!

With regards to the coredump. I have set the coredump to be unlimited and have set it so that all of the dumps go in the folder /coredumps on my server. When It crashed however it did not dump anything in that folder. Do I need to recompile Aolserver with some flags set? you mention that the dump will happen in the application folder. I have aolserver installed /usr/local/aolserver and the actually OpenACS instance at /var/lib/aolserver. Where abouts will this dump file go? For information we are using Red Hat ES 4.

Cheers again.

What a ridiculous scenario. We had 2 news feeds pointing to the BBC website and they were causing the crash. I looked at the scheduled procs on the monitoring application and saw the exact point when it crashed was when the news aggregator updated.

I removed them both and the backup server worked fine as it passed through the proc. It's now been up for an hour after 10 minutes crashes.

Unbelievable really, there were 55,000 rows in the na_items table but only 2 on the na_sources table. Perhaps there was a dodgy entry on the site that was being downloaded?

Anyway, thanks for all of the help and we will work on the other recommendations for tweaking and fixing the duplicate email address error. That problem is caused by a single proc which attempts to grab a user id by email address (very wrong way of doing something!).

Cheers

Hi Matthew,

Unfortunatelly, this issue is quite common. News aggregator UI doesn't have the option to delete old items, and this kind of thing usually happens when your server is significantly old to have lots of news feeds.

We have solved the problem building an UI where you can delete the feed entries. Take a look at your code and see if it can be usefull to you: http://svn.softwarepublico.gov.br/svn/openacs/branches/spb-2.0/packages/news-aggregator/

Most likely some malformed data is causing the server crash. Maybe parsing the XML?

Even if you have a huge amount of data, i had hundreds of thousands of entries in news aggregator and it was really slow but never crashed my server.

I'm encountering this problem after a year without incident.

My nsd process is taking up 150 mb of RAM and I have plenty of available RAM. No swapping is taking place.

However, every 10-20 minutes, the server goes down with a "fatal signal 11" error. This started a few days ago with an increase in traffic on the site. I have other nsds on the same server that are not experiencing this problem.

One odd thing I found:

Last request: 30/May/2012:18:57:22 -0400

Last signal 11:

[30/May/2012:18:58:48][17148.1105754448][-conn:1-] Fatal: received fatal signal 11

How could the server crash on a connection thread *after* the last connection?

Stacksize: 512 * 8192
minthreads: 5
maxthreads: 10

Any idea what might be happening?

Thanks

A few crash bugs in aolserver immediately come to mind.

One occurs when threads exit nearly as fast as they are started, which generally requires maxconnsperthread to be set very low - around 20 or less, which you wouldn't have on a production server - and high request rate, like benchmarking. This problem is reduced but I'm pretty sure not completely eliminated in aolserver HEAD.

Another comes up if you set an redirect for 500 error pages that doesn't exist. It looks for the error page, tries to redirect to not found, gets an error, looks for the error page, tries to redirect, etc, until it runs out of stack space and crashes. This problem is fixed in aolserver HEAD.

There's another one related to mmap failures (NULL is not MAP_FAILED) but I don't recall the specifics of that one.

A stack dump from a core file might help give a better idea.

Ryan,

was the last request from "2012:18:57:22 -0400" also on conn:1? it would help, if you could report the last view error log entries, including previous error.log entries from conn:1.

Thread start and thread end are not unusual places where nsd might crash. E.g. during thread cleanup, all thread specific resources are freed, so any memory corruption will lead to a crash there.

What version of aolserver + tcl + libthread are you using? If you have xotcl-core installed, the quickest way is to check http://YOURSERVER/xotcl/version-numbers

Did you compile Tcl + aolserver + modules yourself? What platform? What version of gcc? What optimization flags?

Background: we experienced many problems under load (more than 800 concurrent users) with gcc 4.1.2 on POWER6+. Most of these were in the thread-local-storage management of tcl 8.5.*, where one sees e.g. crashes during regepx, when the internal representation of regular expressions is kept in thread-local-storage. i rewrote some of these parts in tcl to use a simpler platform/compiler specific implementations of thread local storage, then the problem move to other places in the tcl implementation, also related with TLS. We were never able to produce simple test cases to trigger this crash. Interestingly enough, we never experienced the problem with the exactly same code base on intel, not even with 3000+ concurrent users. ... The message is, the platform/compiler/optimization flags might matter.

Did you get a core dump from the crash? if so, to find the problem space, use "gdb /SOMEPATH/nsd core.XXXX" and then type "bt" in gdb to see where the crash happened.

-gustaf neumann

Ryan ,

Just a small note: In order to have the nsd process generate a core dump just be sure to increase the maximun size of core files the user running the nsd process can create ( usually you want to have an unlimited size, just run 'ulimit -c unlimited' before you start the nsd process, you can add that temporarily to your start up script or so).

Then the core dump will be written to the Aolserver home directory ( e.g. /usr/local/aolserver/ ). Also you have to be sure that the user running the process has the correct permissions to write into that directory otherwise it wont be able to generate the core dump.

Also, in case you have nasty code that changes directories for whatever reason, it can happen that the core dump will be written to a different directory other than the Aolserver home directory, so just be aware of that.

Best,

Thanks to everyone for their quick response.

After a bit more log analysis, I noticed that the signal 11 occurred just after a 404.

Recently I had changed my file-not-found.adp 404 redirect from:

<%
rp_internal_redirect /www/global/404
ns_adp_abort
%>

to

<%
set rendered_page [ad_parse_template "/www/global/404"]
ns_return 404 text/html $rendered_page
%>

The new code seems to not even log 404s in AOLserver 4.5.0, and cause the fatal signal 11.

My test environment running 4.5.1 handled the new code fine, so I guess it's about time I matched the two AOL versions (oops). I know there were some fixes to 404 redirection in 4.5.1 but they seem to indicate the 404 page was not even appearing. In my case it was.

Can anyone confirm that upgrading to 4.5.1 is the solution to this particular issue? I'd like to use the second code fragment so I can use a templated adp page and still return a 404 error code.

Thanks again,
Ryan