Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!

With regards to the duplicate email addresses in the error.log, our system has been modified to accept this, so this is not an issue for us, as we have been running like this for a few years now.

We have restored a backup of the database that was 1 week old but that too failed after running for 8 minutes with the same error message.

The server has 4GB of RAM. We are thinking of adding more RAM into it to see if this helps. But if the lack of memory was a problem wouldn't it just run more slowly rather than just fall over after 5-8 minutes?

If anyone has any further advice on what we can look into it would be a great help thanks.
Shahid

Per default, the scheduled procedure sweep_notifications runs every 10 minutes. That explains the repeating behavior.

concerning the crash: it looks like your server is running out of addressable memory per process (typically 2GB for 32bit processes). All threads share the same address space of the process. If the process runs close to 2GB, and a thread needs more memory, the OS says "no" and kills it. Running out of this limit can have multiple origins; often, calls to external programs can lead to this problem. Adding more physical memory only helps against running out of memory on the OS-level, including swap (rather uncommon, depends what is running aside openacs on the system), but it does not change the 2GB limit. To overcome it, one could compile the server + tcl + modules with 64bit, but in your case, i would not recommend it, since it won't address the cause of the problem.

In your case, i have the following suspicion: Since you wrote that your system has been modified to accept duplicate email addresses a while ago, i would expect that the delivery of notifications is broken since this time. This means that, since this time, notifications are piling up. If there is a huge number of notifications waiting for delivery, and these are loaded into memory, you might run out of resources.

Check in psql how many requests are waiting: select count(*) from notifications

The number should be low, typically 0.
To understand the notification sweeper, check:

https://openacs.org/api-doc/proc-view?proc=notification::sweep::sweep_notifications&source_p=1&version_id=
https://openacs.org/api-doc/proc-view?proc=notification::new&source_p=1&version_id=

Hope, this helps a little.

Hi, Thanks for the reply Gustav.

You were correct, there were over 4000 rows in the notifications table, we removed them and restarted.
The fatal error still occurred though. We re-checked the notifications table after the error and it was zero.

Is there a way to clear all notifications properly?

here are the last few lines of the error.log before the FATAL message...

[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code blocks...
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code block filters_register in /var/lib/aolserver/nccedudotlrn/packages/acs-tc l/tcl/request-processor-init.tcl
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 0 0 1 throttle {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth POST * rp_invoke_filter 1 0 1 throttle {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 2 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 3 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 4 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 5 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e GET /* rp_invoke_filter 6 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e POST /* rp_invoke_filter 7 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e HEAD /* rp_invoke_filter 8 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code block procs_register in /var/lib/aolserver/nccedudotlrn/packages/acs-tcl/ tcl/request-processor-init.tcl
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_proc GET / grades-sheet-csv* rp_invoke_proc {0 0 0 evaluation::generate_grades_sheet {}}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_proc POST /grades-sheet-csv* rp_invoke_proc {0 0 0 evaluation::generate_grades_sheet {}}
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nsmain: AOLserver/4.0.1 0 running
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nsmain: security info: uid=502, euid=502, gid=501, egid=501
[21/Aug/2009:12:31:02][16716.3074354096][-sched-] Notice: sched: starting
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nssock: listening on 19 2.168.0.7:80
[21/Aug/2009:12:31:02][16716.2847898544][-driver-] Notice: starting
[21/Aug/2009:12:31:02][16716.2847898544][-driver-] Notice: driver: accepting con nections
[21/Aug/2009:12:31:36][16716.2843700144][-sched:idle0-] Notice: starting
[21/Aug/2009:12:35:33][16716.2930785200][-conn:nccedudotlrn::1] Notice: random: generating 1 seed
[21/Aug/2009:12:35:34][16716.2930785200][-conn:nccedudotlrn::1] Notice: --create d new persistent ::xotcl::THREAD as tid0xa8ff4bb0 pid=16716, ::throttle ::xotcl: :THREAD->do (1936ms)
[21/Aug/2009:12:35:34][16716.2835303344][-tclthread-] Notice: dbdrv: opening dat abase 'postgres:localhost::nccedudotlrn'
[21/Aug/2009:12:35:34][16716.2835303344][-tclthread-] Notice: Opening nccedudotl rn on localhost
[21/Aug/2009:12:35:35][16716.2835303344][-tclthread-] Notice: Ns_PgOpenDb(postgr es): Openned connection to localhost::nccedudotlrn.
[21/Aug/2009:12:35:35][16716.2835303344][-tclthread-] Notice: +++ package_id of xotcl-request-monitor is 2576709
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: dbdrv: opening data base 'postgres:localhost::nccedudotlrn'
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: Opening nccedudotlr n on localhost
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: Ns_PgOpenDb(postgre s): Openned connection to localhost::nccedudotlrn.
[21/Aug/2009:12:35:41][16716.2839501744][-sched:idle1-] Notice: starting
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property title instead of doc(title).
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property header_stuff instead of head.
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: property focus is deprecated in blank-master - focus should be handled in site-master.
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property title instead of doc(title).
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property header_stuff instead of head.
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: property focus is deprecated in blank-master - focus should be handled in site-master.
[21/Aug/2009:12:35:57][16716.2939182000][-conn:nccedudotlrn::13] Notice: --start ing... /campus/home/ , ::3551196 -> (1211ms)
[21/Aug/2009:12:35:57][16716.2939182000][-conn:nccedudotlrn::13] Notice: --try i ndex -> 0, ::3551196 ::xowiki::Package->resolve_request (1215ms, 3ms)
[21/Aug/2009:12:36:09][16716.2939182000][-conn:nccedudotlrn::13] Warning: db_exe c: longdb 11 seconds nsdb0 select dbqd.xowiki.www.portlets.dotlrn-main-portlet.s elect_communities
[21/Aug/2009:12:36:10][16716.2939182000][-conn:nccedudotlrn::13] Notice: --after options, ::3553858 ::xowiki::Page->view (14276ms, 13061ms)
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Warning: blank- compat: /var/lib/aolserver/nccedudotlrn/packages/xowiki/www/index.vuh uses depre cated property title instead of doc(title).
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Warning: blank- compat: /var/lib/aolserver/nccedudotlrn/packages/xowiki/www/index.vuh uses depre cated property header_stuff instead of head.
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Notice: --i ::3 551196 DONE, ::3551196 -> (18176ms, 3899ms)
[21/Aug/2009:12:39:06][16716.2943380400][-conn:nccedudotlrn::12] Notice: --start ing... /campus/home/ , ::3551196 -> (847ms)
[21/Aug/2009:12:39:06][16716.2943380400][-conn:nccedudotlrn::12] Notice: --try i ndex -> 0, ::3551196 ::xowiki::Package->resolve_request (852ms, 4ms)
[21/Aug/2009:12:39:10][16716.3074354096][-sched-] Fatal: received fatal signal 1 1

Thanks

Hi Gustaf,
We deleeted all the records from the NOTIFICATIONS table (around 400 records) and also deleted all the records from NOTICIFICATION_REQUESTS table (around 65,000 recs).

Restarted the service but received the fatal signal 11 error again.

Is there a proper procedure for clearing the notifications, rather than just deleting the records from the 2 tables I've mentioned?

Thanks,
Shahid

4000 notification entries are not so much to cause any serious trouble, couple of millions might be one.

by looking closer at the log, i see that the crash happens in -sched- (the master scheduler) and not in a scheduled client thread, so the hypothesis with notifications is getting unlikely.

how much memory does nsd occupy after startup (or better, before the crash)? check with ps.

my current guess is that the master scheduler tries to start a client scheduler thread and crashes, but how comes that this happens suddenly, without changing anything on the system? Are you sure, that nobody changed the configuration or updated some shared libraries?
E.g. when one is using some common installed libraries for nsd ot its modules, an update might kill nsd (e.g. when someone replaces libtcl). Check your libraries with "ldd .../nsd".

Checking the core file, where the crash happens, might give as well a hint.

ok, thanks we will look into that now. As far as we are aware nobody has changed the configuration or updated anything.

On startup, nsd occupies around 18.0
I was also watching the top command whilst it crashed and nsd did not get above 20.0

Another problem we have had with nsd for years now, is what seems to be some sort of memory leak. What happens is nsd memory usage will creep up over say 2 weeks, reaching around 65.0 and then the server crashes.
This then requires us to restart the service, where it starts up and nsd is using around 18.0 again, then the loop continues.
We have never been able to identify what causes this.

could this be a contributing factor?

Thanks,
Andy

Could there be a possible corruption with the AOLServer application? Do we need to rebuild it?
what units are "18.0" and "20.0"?
check the size with ps aux and take the value from VSZ.

The "memory leak" is rather "memory fragmentation", which is one of the reasons why busy sites should consider a daily restart.

have you checked the used libraries with ldd? What is your host operating system?

have you checked in the core dump file, where the crash happens? use gdb, see e.g. http://www.ibm.com/developerworks/library/l-gdb/.

It is true that you postgres configuration should be tuned, but that won't crash the aolserver. The error is a segmentation violation, which means that the process tries to access memory not part of its accessible memory. In most cases, this is due to an erratic pointer in a C-Program, but in general, the error can be triggered by multiple causes. Check the following thread and FAQ:
http://lists.debian.org/debian-user/1997/04/msg01588.html
http://www.bitwizard.nl/sig11/

If you are really confident that onbody has changed shared libraries etc. then it might be the hardware. If you have the option, move the installation to a different hardware and test it there. It is certainly worth to run extensive hardware checks.

-gustaf neumann