Forum OpenACS Q&A: AOLserver dying unexpectedly
We have a brand-new 4.6.2 site up and running. AOLserver is dying unexpectedly at random intervals. No messages in syslog or the server log, no indication that a certain URL is causing it. Machine has 4 GB of RAM, is not swapping, and hosts 4 other AOLserver instances that have uptimes in the months. Debian 3.0, kernel 2.4.20. We've tried it under AOLserver 3.3+ad13 and AOLserver 3.5.6.
At Jon Griffin's suggestion we made sure that custom ErrorDocuments were defined and are being served.
Since I first tried to post this we have discovered that our old site (ACS 3.4.x based, aol33+ad13) also does the same thing, just not as frequently. It has never done that before, it has run for months on end.
I am baffled. Any suggestions? Stacksize setting? (currently 500000)
One truly bizzare thing I noticed is *very* large values of /proc/sys/fs/file-max (line in the 300000+ range). On our development server that number never goes about the set value of 10240 (which I put there for Oracle).
Any suggestions are welcome.
Like you, many other aolserver instances run just fine, and it happens that the one that does this is a very strangely modified aolserver. AOLserver 3.3+ad13 & OpenACS 3.2.5 + ACS 3.4 + ACS 4 code + a bunch of other stuff I felt compelled to agglomerate into the mix. And it's doing this on PG 7.3, (or whichever the latest PG is that OpenACS 3.2.5 isn't supposed to support.)
So I've always suspected it was some of my own brain damage.
On the otherhand, this is a very lightly used aolserver instance, and the roughly twelve hour period make me feel it is some scheduled proc or cron job screwing around with my head.
I haven't checked /proc/sys/fs/file-max, maybe I should....
I'd try this even if you're not on Redhat, because the result of a too-small stacksize is exactly what you describe, sudden and immediate nsd death.
We had analyzed our server logs and just couldn't come up with a pattern. Today, however, I bit the bullet and am running the production server under gdb. It is consistently crashing in the PayFlowPro module.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 9226 (LWP 5320)]
0x409ae28c in __umoddi3 ()
(it sometimes crashes elsewhere in the library, often in pfproVersion, but the segfault is always in libpfpro.so)
I found a reference to this on Google:
Which basically says the that payflowpro library is statically linked with OpenSSL 0.9.5, but we are running 0.9.6c. The PHP folks have this problem too.
This explains why our dev servers didn't crash with this problem--the pfpro module just doesn't get as much stress there as it does on production.
Now, how to fix this? Not sure, we are trying to get a beta version of the lib from Verisign right now.
So for the record, it appears fixed.
Thanks to everyone that responded here and on the AOLserver list. As I said over there, helpful people like you are what makes open source work.
They have no plans on changing this design, therefor we will be changing payment processors.
ar -xand then re-link it to the appropriate OpenSSL library with
ld. I'm know nothing about PayflowPro, but in the past I've used ar -x to take apart .a libraries and re-build them into a .so.
However, what is the real problem causing the segafaults here? Is it that libpfpro.so (with OpenSSL 0.9.5 statically linked in) just isn't thread safe, period? Or is some kind of version skew between the OpenSSL 0.9.5 and 0.9.6c libraries? If libpfpro.so has OpenSSL 0.9.5 statically linked in, then why would it matter at all what other version of OpenSSL you have installed on your machine?