Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!

Hi, Thanks for the reply Gustav.

You were correct, there were over 4000 rows in the notifications table, we removed them and restarted.
The fatal error still occurred though. We re-checked the notifications table after the error and it was zero.

Is there a way to clear all notifications properly?

here are the last few lines of the error.log before the FATAL message...

[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code blocks...
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code block filters_register in /var/lib/aolserver/nccedudotlrn/packages/acs-tc l/tcl/request-processor-init.tcl
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 0 0 1 throttle {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth POST * rp_invoke_filter 1 0 1 throttle {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 2 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 3 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 4 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter post auth GET * rp_invoke_filter 5 0 3 curriculum::curriculum_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e GET /* rp_invoke_filter 6 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e POST /* rp_invoke_filter 7 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_filter trac e HEAD /* rp_invoke_filter 8 0 3 ds_trace_filter {}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: Executing initializatio n code block procs_register in /var/lib/aolserver/nccedudotlrn/packages/acs-tcl/ tcl/request-processor-init.tcl
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_proc GET / grades-sheet-csv* rp_invoke_proc {0 0 0 evaluation::generate_grades_sheet {}}
[21/Aug/2009:12:31:00][16716.3086948576][-main-] Notice: ns_register_proc POST /grades-sheet-csv* rp_invoke_proc {0 0 0 evaluation::generate_grades_sheet {}}
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nsmain: AOLserver/4.0.1 0 running
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nsmain: security info: uid=502, euid=502, gid=501, egid=501
[21/Aug/2009:12:31:02][16716.3074354096][-sched-] Notice: sched: starting
[21/Aug/2009:12:31:02][16716.3086948576][-main-] Notice: nssock: listening on 19 2.168.0.7:80
[21/Aug/2009:12:31:02][16716.2847898544][-driver-] Notice: starting
[21/Aug/2009:12:31:02][16716.2847898544][-driver-] Notice: driver: accepting con nections
[21/Aug/2009:12:31:36][16716.2843700144][-sched:idle0-] Notice: starting
[21/Aug/2009:12:35:33][16716.2930785200][-conn:nccedudotlrn::1] Notice: random: generating 1 seed
[21/Aug/2009:12:35:34][16716.2930785200][-conn:nccedudotlrn::1] Notice: --create d new persistent ::xotcl::THREAD as tid0xa8ff4bb0 pid=16716, ::throttle ::xotcl: :THREAD->do (1936ms)
[21/Aug/2009:12:35:34][16716.2835303344][-tclthread-] Notice: dbdrv: opening dat abase 'postgres:localhost::nccedudotlrn'
[21/Aug/2009:12:35:34][16716.2835303344][-tclthread-] Notice: Opening nccedudotl rn on localhost
[21/Aug/2009:12:35:35][16716.2835303344][-tclthread-] Notice: Ns_PgOpenDb(postgr es): Openned connection to localhost::nccedudotlrn.
[21/Aug/2009:12:35:35][16716.2835303344][-tclthread-] Notice: +++ package_id of xotcl-request-monitor is 2576709
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: dbdrv: opening data base 'postgres:localhost::nccedudotlrn'
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: Opening nccedudotlr n on localhost
[21/Aug/2009:12:35:40][16716.2843700144][-sched:18-] Notice: Ns_PgOpenDb(postgre s): Openned connection to localhost::nccedudotlrn.
[21/Aug/2009:12:35:41][16716.2839501744][-sched:idle1-] Notice: starting
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property title instead of doc(title).
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property header_stuff instead of head.
[21/Aug/2009:12:35:45][16716.2895297456][-conn:nccedudotlrn::5] Warning: blank-c ompat: property focus is deprecated in blank-master - focus should be handled in site-master.
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property title instead of doc(title).
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: /var/lib/aolserver/nccedudotlrn/packages/acs-subsite/www/register/index.a dp uses deprecated property header_stuff instead of head.
[21/Aug/2009:12:35:50][16716.2886900656][-conn:nccedudotlrn::4] Warning: blank-c ompat: property focus is deprecated in blank-master - focus should be handled in site-master.
[21/Aug/2009:12:35:57][16716.2939182000][-conn:nccedudotlrn::13] Notice: --start ing... /campus/home/ , ::3551196 -> (1211ms)
[21/Aug/2009:12:35:57][16716.2939182000][-conn:nccedudotlrn::13] Notice: --try i ndex -> 0, ::3551196 ::xowiki::Package->resolve_request (1215ms, 3ms)
[21/Aug/2009:12:36:09][16716.2939182000][-conn:nccedudotlrn::13] Warning: db_exe c: longdb 11 seconds nsdb0 select dbqd.xowiki.www.portlets.dotlrn-main-portlet.s elect_communities
[21/Aug/2009:12:36:10][16716.2939182000][-conn:nccedudotlrn::13] Notice: --after options, ::3553858 ::xowiki::Page->view (14276ms, 13061ms)
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Warning: blank- compat: /var/lib/aolserver/nccedudotlrn/packages/xowiki/www/index.vuh uses depre cated property title instead of doc(title).
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Warning: blank- compat: /var/lib/aolserver/nccedudotlrn/packages/xowiki/www/index.vuh uses depre cated property header_stuff instead of head.
[21/Aug/2009:12:36:14][16716.2939182000][-conn:nccedudotlrn::13] Notice: --i ::3 551196 DONE, ::3551196 -> (18176ms, 3899ms)
[21/Aug/2009:12:39:06][16716.2943380400][-conn:nccedudotlrn::12] Notice: --start ing... /campus/home/ , ::3551196 -> (847ms)
[21/Aug/2009:12:39:06][16716.2943380400][-conn:nccedudotlrn::12] Notice: --try i ndex -> 0, ::3551196 ::xowiki::Package->resolve_request (852ms, 4ms)
[21/Aug/2009:12:39:10][16716.3074354096][-sched-] Fatal: received fatal signal 1 1

Thanks

Hi Gustaf,
We deleeted all the records from the NOTIFICATIONS table (around 400 records) and also deleted all the records from NOTICIFICATION_REQUESTS table (around 65,000 recs).

Restarted the service but received the fatal signal 11 error again.

Is there a proper procedure for clearing the notifications, rather than just deleting the records from the 2 tables I've mentioned?

Thanks,
Shahid

4000 notification entries are not so much to cause any serious trouble, couple of millions might be one.

by looking closer at the log, i see that the crash happens in -sched- (the master scheduler) and not in a scheduled client thread, so the hypothesis with notifications is getting unlikely.

how much memory does nsd occupy after startup (or better, before the crash)? check with ps.

my current guess is that the master scheduler tries to start a client scheduler thread and crashes, but how comes that this happens suddenly, without changing anything on the system? Are you sure, that nobody changed the configuration or updated some shared libraries?
E.g. when one is using some common installed libraries for nsd ot its modules, an update might kill nsd (e.g. when someone replaces libtcl). Check your libraries with "ldd .../nsd".

Checking the core file, where the crash happens, might give as well a hint.

ok, thanks we will look into that now. As far as we are aware nobody has changed the configuration or updated anything.

On startup, nsd occupies around 18.0
I was also watching the top command whilst it crashed and nsd did not get above 20.0

Another problem we have had with nsd for years now, is what seems to be some sort of memory leak. What happens is nsd memory usage will creep up over say 2 weeks, reaching around 65.0 and then the server crashes.
This then requires us to restart the service, where it starts up and nsd is using around 18.0 again, then the loop continues.
We have never been able to identify what causes this.

could this be a contributing factor?

Thanks,
Andy

Could there be a possible corruption with the AOLServer application? Do we need to rebuild it?
what units are "18.0" and "20.0"?
check the size with ps aux and take the value from VSZ.

The "memory leak" is rather "memory fragmentation", which is one of the reasons why busy sites should consider a daily restart.

have you checked the used libraries with ldd? What is your host operating system?

have you checked in the core dump file, where the crash happens? use gdb, see e.g. http://www.ibm.com/developerworks/library/l-gdb/.

It is true that you postgres configuration should be tuned, but that won't crash the aolserver. The error is a segmentation violation, which means that the process tries to access memory not part of its accessible memory. In most cases, this is due to an erratic pointer in a C-Program, but in general, the error can be triggered by multiple causes. Check the following thread and FAQ:
http://lists.debian.org/debian-user/1997/04/msg01588.html
http://www.bitwizard.nl/sig11/

If you are really confident that onbody has changed shared libraries etc. then it might be the hardware. If you have the option, move the installation to a different hardware and test it there. It is certainly worth to run extensive hardware checks.

-gustaf neumann