Forum OpenACS Q&A: Site keeps going down (and automatically restarted)

Hi everyone:

Most of my OpenACS sites are running just swimmingly, but one site is having very peculiar behavior.

Consistently since I've set up the site, it's been having issues where it just stops responding to web requests.

I have set up the keepalive scripts, and they restart the server whenever it becomes unavailable. However, the scripts seem to take a while to take effect -- like they are waiting for a timeout or something. Thus, when the restart happens, I get 5-7 restarts at the same time.

I have the keepalive script set up to email me when the server restarts. Here are the times from the recent restart (between 1/19/05 and 1/25/05):

1/19/05 10:46 PM
1/19/05 10:46 PM
1/19/05 10:46 PM
1/19/05 10:46 PM
1/19/05 10:47 PM
1/19/05 10:47 PM
1/20/05 4:14 PM
1/20/05 4:14 PM
1/20/05 4:14 PM
1/20/05 4:15 PM
1/20/05 4:15 PM
1/21/05 3:13 PM
1/21/05 3:13 PM
1/21/05 3:13 PM
1/21/05 3:14 PM
1/21/05 3:14 PM
1/21/05 3:14 PM
1/21/05 3:14 PM
1/21/05 3:15 PM
1/23/05 7:26 PM
1/24/05 2:14 PM
1/24/05 2:14 PM
1/24/05 2:14 PM
1/24/05 2:14 PM
1/24/05 2:14 PM
1/24/05 2:15 PM
1/25/05 10:46 AM
1/25/05 10:46 AM
1/25/05 10:46 AM
1/25/05 10:46 AM
1/25/05 10:47 AM
1/25/05 1:50 PM
1/25/05 1:50 PM
1/25/05 1:50 PM
1/25/05 1:50 PM
1/25/05 1:51 PM
1/25/05 1:51 PM
1/25/05 1:52 PM

One thing I did recently is set the server to restart once a night. That hasn't seemed to help. I'm disabling that now.

I currently have my crontab for the user owning the Aolserver process set as follows:

*/4 * * * * /bin/sh /var/lib/aolserver/usb/etc/keepalive/keepalive-cron.sh > /dev/null 2>&1

I will switch that now to */5 to see if that helps prevent the multiple restarts.

I've looked at the config.tcl file, and there are no substantial changes from the standard OpenACS 5.1.2 config.tcl file (I did a diff and compared them).

Any suggestions? Should I up the maxconnections or anything else here?

ns_section ns/server/${server}
ns_param directoryfile $directoryfile
ns_param pageroot $pageroot
ns_param maxconnections 5
ns_param maxdropped 0
ns_param maxthreads 5
ns_param minthreads 5
ns_param threadtimeout 120
ns_param globalstats false ;# Enable built-in statistics
ns_param urlstats false ;# Enable URL statistics
ns_param maxurlstats 1000 ;# Max number of URL's to do stats on
#ns_param directoryadp $pageroot/dirlist.adp ;# Choose one or the other
#ns_param directoryproc _ns_dirlist ;# ...but not both!
#ns_param directorylisting fancy ;# Can be simple or fancy

I would really appreciate some suggestions.

Collapse
Posted by Jeff Davis on
wget has a default read timeout of 900 seconds
according to the manpage. What I think happens is the first
one takes 15 minutes to timeout and restart the server.
then the other 3 which have been trying to read as well
then get fails immediately when the server is restarted
which gives you the next 3 restarts.

adding --timeout=20 to the wget might fix it (it's worth a try anyway). I suspect if the server is taking more than 20 seconds to serve the dbtest page its probably in a sick state.

Collapse
Posted by Torben Brosten on
Regarding maxconnections, Dossy writes on 20040818 (from aolserver list, where the archives are apparently down right now):

BTW, what's the guidelines on setting maxconnections? Should it be the
> same as max and min threads?

There's no hard rules, but I'd recommend setting maxconns up around
100-150. Essentially, 30x-50x the threads, as a rough guide.

-- Dossy

--

Collapse
Posted by Janine Ohmer on
We (furfly) had a client site doing this recently and it turned out that they weren't setting the stacksize in the ns/parameters section, so it was defaulting to some too-small value. You might want to check to make sure you've got that, and if you do then maybe increase it.
Your AOLserver config settings look kind of crappy. First of all, your 120 s threadtimeout is much too low. Secondly, why do you have max and minthreads both set to only 5 ? Are you sure that your site isn't hanging simply because you have 6 clients trying to load a page at once?

I'm not sure what maxconnections should be set to, but I simply never set it at all. Looks like it defaults to 100 (at least in AOLserver 4.0.x).

What is different about this problem site vs. your other OpenACS sites?

Collapse
Posted by Don Baccus on
If you have maxconnections set to five and six people hit the site, it isn't going to hang. The webserver queues requests and assigns them to threads as they become available ... there is a configurable limit on how many requests it will queue, too, but if you hit that limit AOLserver returns an error to the user. It's not supposed to hang or crash in such situations.

Just thought I'd correct one simplistic misconception.

Collapse
Posted by Jade Rubick on
Thank you all for your help.

Bumping the crontab job to every five minutes seemed to stop the issue from happening at all. Which is awfully weird.

I haven't changed the keepalive scripts, but I put a link to this thread in them, since I've run into this problem before.

I was running the server on the OpenACS defaults. I'm committing changes to those defaults based on this thread:

safe4all-dev@safe4all:~/oacs-5-1/etc$ cvs diff -u config.tcl
mailto:jader@cvs.openacs.org's password:
Index: config.tcl
===================================================================
RCS file: /cvsroot/openacs-4/etc/config.tcl,v
retrieving revision 1.19.2.18
diff -u -r1.19.2.18 config.tcl
--- config.tcl  19 Jan 2005 00:46:47 -0000      1.19.2.18
+++ config.tcl  7 Feb 2005 00:34:05 -0000
@@ -131,11 +131,11 @@
ns_section ns/server/${server}
    ns_param  directoryfile      $directoryfile
    ns_param  pageroot          $pageroot
-    ns_param  maxconnections    5
+    ns_param  maxconnections    100      ;# Max connections to put on queue
    ns_param  maxdropped        0
-    ns_param  maxthreads        5
+    ns_param  maxthreads        10
    ns_param  minthreads        5
-    ns_param  threadtimeout      120
+    ns_param  threadtimeout      120      ;# Idle threads die at this rate
    ns_param  globalstats        false    ;# Enable built-in statistics
    ns_param  urlstats          false    ;# Enable URL statistics
    ns_param  maxurlstats        1000    ;# Max number of URL's to do stats on

Collapse
Posted by Alex Kroman on
I've been running into the same problem that Jade mentions above and the problem is exactly what Jeff says. Something is hanging the server and every 3 minutes another keepalive cron is spawned because the previous one is still waiting for the 15 minute timeout.

To prevent this I simply added --timeout=20 to the wget script. I'm not sure if this timeout threshold would be good for everyone but we should add some sort of timeout to the script and/or advise people to make sure the timeout they set is shorter then the length of time between cron runs.

Collapse
Posted by Jade Rubick on
Another problem is that the script is not threadsafe. Or at least the version I'm looking at.

If multiple copies of keepalive are being run, they all look at the same file.