Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!
My nsd process is taking up 150 mb of RAM and I have plenty of available RAM. No swapping is taking place.
However, every 10-20 minutes, the server goes down with a "fatal signal 11" error. This started a few days ago with an increase in traffic on the site. I have other nsds on the same server that are not experiencing this problem.
One odd thing I found:
Last request: 30/May/2012:18:57:22 -0400
Last signal 11:
[30/May/2012:18:58:48][17148.1105754448][-conn:1-] Fatal: received fatal signal 11
How could the server crash on a connection thread *after* the last connection?
Stacksize: 512 * 8192
Any idea what might be happening?
One occurs when threads exit nearly as fast as they are started, which generally requires maxconnsperthread to be set very low - around 20 or less, which you wouldn't have on a production server - and high request rate, like benchmarking. This problem is reduced but I'm pretty sure not completely eliminated in aolserver HEAD.
Another comes up if you set an redirect for 500 error pages that doesn't exist. It looks for the error page, tries to redirect to not found, gets an error, looks for the error page, tries to redirect, etc, until it runs out of stack space and crashes. This problem is fixed in aolserver HEAD.
There's another one related to mmap failures (NULL is not MAP_FAILED) but I don't recall the specifics of that one.
A stack dump from a core file might help give a better idea.
was the last request from "2012:18:57:22 -0400" also on conn:1? it would help, if you could report the last view error log entries, including previous error.log entries from conn:1.
Thread start and thread end are not unusual places where nsd might crash. E.g. during thread cleanup, all thread specific resources are freed, so any memory corruption will lead to a crash there.
What version of aolserver + tcl + libthread are you using? If you have xotcl-core installed, the quickest way is to check http://YOURSERVER/xotcl/version-numbers
Did you compile Tcl + aolserver + modules yourself? What platform? What version of gcc? What optimization flags?
Background: we experienced many problems under load (more than 800 concurrent users) with gcc 4.1.2 on POWER6+. Most of these were in the thread-local-storage management of tcl 8.5.*, where one sees e.g. crashes during regepx, when the internal representation of regular expressions is kept in thread-local-storage. i rewrote some of these parts in tcl to use a simpler platform/compiler specific implementations of thread local storage, then the problem move to other places in the tcl implementation, also related with TLS. We were never able to produce simple test cases to trigger this crash. Interestingly enough, we never experienced the problem with the exactly same code base on intel, not even with 3000+ concurrent users. ... The message is, the platform/compiler/optimization flags might matter.
Did you get a core dump from the crash? if so, to find the problem space, use "gdb /SOMEPATH/nsd core.XXXX" and then type "bt" in gdb to see where the crash happened.
Just a small note: In order to have the nsd process generate a core dump just be sure to increase the maximun size of core files the user running the nsd process can create ( usually you want to have an unlimited size, just run 'ulimit -c unlimited' before you start the nsd process, you can add that temporarily to your start up script or so).
Then the core dump will be written to the Aolserver home directory ( e.g. /usr/local/aolserver/ ). Also you have to be sure that the user running the process has the correct permissions to write into that directory otherwise it wont be able to generate the core dump.
Also, in case you have nasty code that changes directories for whatever reason, it can happen that the core dump will be written to a different directory other than the Aolserver home directory, so just be aware of that.