Forum OpenACS Q&A: Re: Fatal: received fatal signal 11

Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!

15: Re: Fatal: received fatal signal 11 - new error after years! (response to 13)

Posted by Andy Black on 08/21/09 03:54 PM

ok, thanks we will look into that now. As far as we are aware nobody has changed the configuration or updated anything.

On startup, nsd occupies around 18.0
I was also watching the top command whilst it crashed and nsd did not get above 20.0

Another problem we have had with nsd for years now, is what seems to be some sort of memory leak. What happens is nsd memory usage will creep up over say 2 weeks, reaching around 65.0 and then the server crashes.
This then requires us to restart the service, where it starts up and nsd is using around 18.0 again, then the loop continues.
We have never been able to identify what causes this.

could this be a contributing factor?

Thanks,
Andy

25: Re: Fatal: received fatal signal 11 - new error after years! (response to 15)

Posted by Gustaf Neumann on 08/21/09 11:09 PM

what units are "18.0" and "20.0"?
check the size with ps aux and take the value from VSZ.

The "memory leak" is rather "memory fragmentation", which is one of the reasons why busy sites should consider a daily restart.

have you checked the used libraries with ldd? What is your host operating system?

have you checked in the core dump file, where the crash happens? use gdb, see e.g. http://www.ibm.com/developerworks/library/l-gdb/.

It is true that you postgres configuration should be tuned, but that won't crash the aolserver. The error is a segmentation violation, which means that the process tries to access memory not part of its accessible memory. In most cases, this is due to an erratic pointer in a C-Program, but in general, the error can be triggered by multiple causes. Check the following thread and FAQ:
http://lists.debian.org/debian-user/1997/04/msg01588.html
http://www.bitwizard.nl/sig11/

If you are really confident that onbody has changed shared libraries etc. then it might be the hardware. If you have the option, move the installation to a different hardware and test it there. It is certainly worth to run extensive hardware checks.

-gustaf neumann