Forum OpenACS Q&A: AOLserver dying unexpectedly

Request notifications

Posted by C. R. Oldham on

We have a brand-new 4.6.2 site up and running.  AOLserver is dying unexpectedly at random intervals.  No messages in syslog or the server log, no indication that a certain URL is causing it.  Machine has 4 GB of RAM, is not swapping, and hosts 4 other AOLserver instances that have uptimes in the months.  Debian 3.0, kernel 2.4.20.  We've tried it under AOLserver 3.3+ad13 and AOLserver 3.5.6.

At Jon Griffin's suggestion we made sure that custom ErrorDocuments were defined and are being served.

Since I first tried to post this we have discovered that our old site (ACS 3.4.x based, aol33+ad13) also does the same thing, just not as frequently.  It has never done that before, it has run for months on end.

I am baffled.  Any suggestions?  Stacksize setting? (currently 500000)

One truly bizzare thing I noticed is *very* large values of /proc/sys/fs/file-max (line in the 300000+ range).  On our development server that number never goes about the set value of 10240 (which I put there for Oracle).

Any suggestions are welcome.

Posted by Jerry Asher on
I have one instance that does that, about every twelve hours, and I haven't had enough aolserver-fu to have been able to diagnose it.

Like you, many other aolserver instances run just fine, and it happens that the one that does this is a very strangely modified aolserver.  AOLserver 3.3+ad13 & OpenACS 3.2.5 + ACS 3.4 + ACS 4 code + a bunch of other stuff I felt compelled to agglomerate into the mix. And it's doing this on PG 7.3, (or whichever the latest PG is that OpenACS 3.2.5 isn't supposed to support.)

So I've always suspected it was some of my own brain damage.

On the otherhand, this is a very lightly used aolserver instance, and the roughly twelve hour period make me feel it is some scheduled proc or cron job screwing around with my head.

I haven't checked /proc/sys/fs/file-max, maybe I should....

Posted by Janine Ohmer on
Are you running Redhat?  There was a RH version in the 7s, I've forgotten which one, which changed something that required everyone to up their stack size.  I currently have "expr 1024*1034" for the main stacksize (in ns/parameters) and "expr 512*1024" in the per-thread stacksize.

I'd try this even if you're not on Redhat, because the result of a too-small stacksize is exactly what you describe, sudden and immediate nsd death.

Posted by Tilmann Singer on
Is there a particular reason for the 1034 in "expr 1024*1034" or is it a typo?
Posted by Janine Ohmer on
Typo, sorry.
Posted by Jade Rubick on
Pretty unlikely, but I'd also check the log file sizes. If the logs are growing rapidly before being rolled, perhaps that could cause it.
Posted by C. R. Oldham on
I'm reposting this here for the benefit of anyone who might not read the AOLserver list.

We had analyzed our server logs and just couldn't come up with a pattern.  Today, however, I bit the bullet and am running the production server under gdb.  It is consistently crashing in the PayFlowPro module.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 9226 (LWP 5320)]
0x409ae28c in __umoddi3 ()
  from /usr/local/verisign/payflowpro/linux/lib/

(it sometimes crashes elsewhere in the library, often in pfproVersion, but the segfault is always in

I found a reference to this on Google:

Which basically says the that payflowpro library is statically linked with OpenSSL 0.9.5, but we are running 0.9.6c.  The PHP folks have this problem too.

This explains why our dev servers didn't crash with this problem--the pfpro module just doesn't get as much stress there as it does on production.

Now, how to fix this?  Not sure, we are trying to get a beta version of the lib from Verisign right now.

Posted by C. R. Oldham on
Brad just corrected me.  To be specific our crashes are occuring in Verisign's PayFlowPro SDK library, *not* the PayFlowPro AOLserver module that he wrote to enable  AOLserver to talk to Verisign.
Posted by C. R. Oldham on
Last night we reversed the module load order in nsd.tcl so the verisign module (which subsequently loads the PayFlowPro libraries) is loaded *after* nsopenssl.  Our server has been up for almost 24 hours, a record.  I think that is at least a working solution.  The real fix, of course, would be to get a copy of dynamically linked against OpenSSL, but Verisign still has not returned our calls to level 2 technical support.

So for the record, it appears fixed. :-)

Thanks to everyone that responded here and on the AOLserver list.  As I said over there, helpful people like you are what makes open source work.

Posted by Brad Duell on
I have verified, through VeriSign, that the SDK of PayflowPro (currently, v. 3.06) is statically linked against the openssl libraries.

They have no plans on changing this design, therefor we will be changing payment processors.

Good luck!

Posted by C. R. Oldham on
I forgot to mention earlier that our server does still crash, it just does it much less frequently now.  Since it can take up to 90 seconds for it to auto-restart, we've decided (as Brad mentioned above) that Verisign is no longer an option.
Posted by Andrew Piskorski on
C.R., just what compiled libraries does PayflowPro ship with? Because I'm not sure, but you might be able to de-link into object files with ar -x and then re-link it to the appropriate OpenSSL library with ld. I'm know nothing about PayflowPro, but in the past I've used ar -x to take apart .a libraries and re-build them into a .so.

However, what is the real problem causing the segafaults here? Is it that (with OpenSSL 0.9.5 statically linked in) just isn't thread safe, period? Or is some kind of version skew between the OpenSSL 0.9.5 and 0.9.6c libraries? If has OpenSSL 0.9.5 statically linked in, then why would it matter at all what other version of OpenSSL you have installed on your machine?