Forum OpenACS Q&A: Re: Fatal: received fatal signal 11

13: Re: Fatal: received fatal signal 11 - new error after years! (response to 12)

Posted by Gustaf Neumann on 08/21/09 03:25 PM

4000 notification entries are not so much to cause any serious trouble, couple of millions might be one.

by looking closer at the log, i see that the crash happens in -sched- (the master scheduler) and not in a scheduled client thread, so the hypothesis with notifications is getting unlikely.

how much memory does nsd occupy after startup (or better, before the crash)? check with ps.

my current guess is that the master scheduler tries to start a client scheduler thread and crashes, but how comes that this happens suddenly, without changing anything on the system? Are you sure, that nobody changed the configuration or updated some shared libraries?
E.g. when one is using some common installed libraries for nsd ot its modules, an update might kill nsd (e.g. when someone replaces libtcl). Check your libraries with "ldd .../nsd".

Checking the core file, where the crash happens, might give as well a hint.

15: Re: Fatal: received fatal signal 11 - new error after years! (response to 13)

Posted by Andy Black on 08/21/09 03:54 PM

ok, thanks we will look into that now. As far as we are aware nobody has changed the configuration or updated anything.

On startup, nsd occupies around 18.0
I was also watching the top command whilst it crashed and nsd did not get above 20.0

Another problem we have had with nsd for years now, is what seems to be some sort of memory leak. What happens is nsd memory usage will creep up over say 2 weeks, reaching around 65.0 and then the server crashes.
This then requires us to restart the service, where it starts up and nsd is using around 18.0 again, then the loop continues.
We have never been able to identify what causes this.

could this be a contributing factor?

Thanks,
Andy

16: Re: Fatal: received fatal signal 11 - new error after years! (response to 13)

Posted by Shahid Butt on 08/21/09 03:57 PM

Could there be a possible corruption with the AOLServer application? Do we need to rebuild it?

25: Re: Fatal: received fatal signal 11 - new error after years! (response to 15)

Posted by Gustaf Neumann on 08/21/09 11:09 PM

what units are "18.0" and "20.0"?
check the size with ps aux and take the value from VSZ.

The "memory leak" is rather "memory fragmentation", which is one of the reasons why busy sites should consider a daily restart.

have you checked the used libraries with ldd? What is your host operating system?

have you checked in the core dump file, where the crash happens? use gdb, see e.g. http://www.ibm.com/developerworks/library/l-gdb/.

It is true that you postgres configuration should be tuned, but that won't crash the aolserver. The error is a segmentation violation, which means that the process tries to access memory not part of its accessible memory. In most cases, this is due to an erratic pointer in a C-Program, but in general, the error can be triggered by multiple causes. Check the following thread and FAQ:
http://lists.debian.org/debian-user/1997/04/msg01588.html
http://www.bitwizard.nl/sig11/

If you are really confident that onbody has changed shared libraries etc. then it might be the hardware. If you have the option, move the installation to a different hardware and test it there. It is certainly worth to run extensive hardware checks.

-gustaf neumann

Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!