Forum OpenACS Q&A: Strange Crash in OpenACS: ns_rand

Collapse
Posted by Tom Jackson on

Over the weekend an OpenACS 4.6 checkout developed a very nasty bug. Upon startup, everthing is going fine. The last log line before the crash and restart is:

[12/Nov/2002:13:23:33][23597.4101][-conn0-] Notice: random: generating 1 seeds

This log message is from nsd/random.c/Ns_GenSeeds

The result was a server restarting every minute or so. I initially thought this had something to do with watchdog, which I had just activated, and which ran just before the ns_rand code. I deactivated watchdog, which was sending me error mail every minute or less, but the problem still existed.

I finally used the same nsd.tcl startup script on a simple tcl directory, just an init.tcl file which contained a call to ns_rand. The server started up just fine. ns_rand didn't seem to cause a crash in this situation.

Collapse
Posted by Dan Wickstrom on
That's interesting.  After downloading and installing openacs 4.6 and dotlrn-1-0, I can consistently crash (segfault) aolserver by selecting the "my portal" link in dotlrn.  The backtrace is very cryptic, but it shows that aolserver was executing in the regexp code for tcl8.3.  I've tried unloading all unecessary modules without any change.  I'm using aolserver3.3+ad13 on solaris 2.8.

What version of aolserver are you using?  Have you tried loading the core file to see where the crash occured?

Collapse
Posted by Dan Wickstrom on
That's interesting.  After downloading and installing openacs 4.6 and dotlrn-1-0, I can consistently crash (segfault) aolserver by selecting the "my portal" link in dotlrn.  The backtrace is very cryptic, but it shows that aolserver was executing in the regexp code for tcl8.3.  I've tried unloading all unecessary modules without any change.  I'm using aolserver3.3+ad13 on solaris 2.8.

What version of aolserver are you using?  Have you tried loading the core file to see where the crash occured?

Collapse
Posted by Dan Wickstrom on
The duplicate post is due to the server giving me an error on the first post.  I thought it didn't go through, so I re-submitted it.
Collapse
Posted by Tom Jackson on

I'm using AOLserver/3.4.2.

I should probably stick in a few more Ns_Logs to see if the crash happens in Ns_GenSeeds.

Collapse
Posted by Dave Bauer on
Dan,

Your problem looks like the stacksize issue. If your stacksize in the nsd.tcl is less than 500000 try increasing it.

Collapse
Posted by Janine Ohmer on
Dan, I had the same problem once, on a new install.  I went around and around in circles trying to figure it out, including reinstalling several times.  Finally in desperation I installed exactly the same code on a different system;  no problems there.  The only difference was that the one that didn't work was running a slightly different version of Redhat than the others (and no, I'm afraid I don't remember what version it was).  I think it was a Postgres install, though I'm not 100% positive.

I know you're on Solaris so it's not directly related but it might be interesting to try it on a different system and see what happens.

Collapse
Posted by Dan Wickstrom on
Dave, my statcksize is set to 500000.  Does it need to be any bigger than that?

Janine, I'm going to try an install on redhat 8, and see if I can get it to work there.  Speaking of redhat, has anyone installed oracle on redhat 8?

Collapse
Posted by Jun Yamog on
Hi Dan,

I have installed Oracle8i on RH 8.  You will need RH 7.3 (in particular disk 2) for the compat libs to RH 6.2.  RH 8 does not have the compat libs anymore.  So you can use 8i on RH 8.  Works ok on OpenACS and CCM.  I have also renamed the ctxhx0 to ctxhx... works ok.

I can email you later the compat libs listing (rpm -qa | grep compat) that I have installed in case you want them.

Do you think 4.7 will make use of 9i?  The only reason that I am sticking to 8i right now is because of OpenACS.

Oh yeah... in case you are using JDBC, the lastest downloadable drivers in Oracle classes12.zip does not seem to work.  So I use thin mode or if I use oci8 I use an older version.

Collapse
Posted by Cheng-Yi Hsu on
Yes, I had install Oracle 9.0.1 on Redhat 8 with no problem, you can reference this Document , after install oracle, I downgrade gcc to 2.96 and build oracle-driver, redhat 8 seems has better thread support, use ps -ef|grep nsd , you can find there were only one nsd process, I think the aolserver performance on PIII 1G/Redhat 8 will beat a SUN 280R , now we become sugguest our customer use Redhat8 to replace SUN or HP application server.
Collapse
Posted by Dan Wickstrom on
Thanks for the tip Jun, it looks like the install docs which refer to redhat 7.3 can also be applied to redhat 8.0.  Once I installed the compat libs the installer started working correctly.  I haven't quite finished the install, but when I'm done, I'll update the install docs to add a note about installing with redhat 8.

As far as oracle 9, there were some past discussions about using oracle 9 with openacs 4.  I think there are still some issues with regards to running oracle 9 with openacs 4.  If there are still any issues left, I think we should think about rectifying them for the 4.7 release.

Collapse
Posted by Dan Wickstrom on
Janine, I upgraded to aolserver 3.5.1 and the crash went away on solaris.  Possibly there was a bug in aolserver .  Hopefully, aol will adopt the i18n patches from aolserver3.3+ad13, so we can get back to using the latest version of aolserver for openacs.
Collapse
Posted by Tom Jackson on

This is an update on the crash associated with ns_rand. I am using an old version of glibc: 2.2.4, which apparently doesn't have the sacksize issue mentioned by Dan. Carl Garland suggested that I move the initial call to ns_rand to my init.tcl file so that there would be less of a noticeable delay for the first call. Also, the C function Ns_GetSeed uses some interesting thread/sema code, which might actually be the cause of the problem.

Since this is a new issue, which might not be clear to future readers of this thread, it would be great if the new forums package would allow moving the stacksize discussion to a separate new thread.

Until ns_rand is fixed, OpenACS installations should probably call it in their init.tcl file:

ns_log Notice "init.tcl: about to initialize ns_rand"
set x [ns_rand 1000000]
ns_log Notice "init.tcl: ns_rand initialized x=$x"