Forum OpenACS Q&A: Problem with OpenACS site that stops responding (and keepalive script)

I'm having some very strange behavior on one of my OpenACS installations (http://www.usbakery.com)

First of all, the site itself seems to die randomly every 24 hours or so. I'm running OpenACS 5.1.2, Aolserver 4.08, and the newest nsopenssl.

The extra packages I'm using are:

file-storage
edit-this-page
notifications
oacs-dav
postcard
survey

What is strange is I don't see any unusual behavior in the error.log. The site is very simple, with no real customization except for templates and so on.

The site just stops responding. The Aolserver processes are still running, however.

I should have telneted to the port to see what they show. I'll try that next time.

I've got the etc/keepalive script running, and it also doesn't seem to run as advertised. It takes between 3-20 minutes to actually restart. I wonder if the wget command, which tries to restart a number of times, is being too cautious or something. An older version of the keepalive script didn't seem to have these problems.

Here's how I have it set up:

usb@www:~/usb-site/log$ crontab -l
3 1-23 * * * /usr/lib/postgresql/bin/vacuumdb --analyze usb > /dev/null 2>&1
3 0 * * * /usr/lib/postgresql/bin/vacuumdb --full --analyze usb > /dev/null 2>&1
30 0 * * * /usr/bin/pg_dump -f /var/lib/aolserver/usb/database-backup/backup.dmp usb
*/4 * * * * /bin/sh /var/lib/aolserver/usb/etc/keepalive/keepalive-cron.sh
30 0 * * * /usr/share/analog-5.32/analog -G -g/var/lib/aolserver/usb/etc/analog.cfg

keepalive-cron.sh uses the keepalive-config script:

# Config file for the keepalive.sh script
#
# @author Peter Marklund

# The servers_to_monitor variable should be a flat list with URLs to monitor
# on even indices and the commands to execute if the server doesn't respond
# on odd indices, like this:
# {server_url1 restart_command1 server_url2 restart_command2 ...}
set servers_to_monitor {http://69.93.192.95 "/home/usb/bin/restart-server"}

# How long the keepalive script waits until it attempts another restart
set seconds_between_restarts [expr 10*60]

And the restart-server script is as follows:

tail -n 100 /home/usb/usb-site/log/error.log | /usr/bin/mail -s "*** USB Server Restart" mailto:myemailaddress@myservername.com
/usr/local/bin/svc -t /service/usb

Is anyone else using the keepalive script exactly as it is from the 5.1.2 installation, and having it work perfectly?

Is anyone else having their servers stop responding like this? It reminds me of the pre-Aolserver 4.08 + nsopenssl HEAD problems we were having earlier...

Any suggestions?

Oh, and another thing:

Every once and a while, when I do restart the service with svc -t /service/usb, I get this error in the error.log:

[17/Nov/2004:10:52:40][22685.1024][-main-] Error: nsopenssl: failed to listen on 0.0.0.0:443: Permission denied
[17/Nov/2004:10:52:40][22685.1024][-main-] Error: nssock: failed to listen on 0.0.0.0:80: Permission denied

this might not be relevant to what you are experiencing but from experience running a site for a long time I have noticed that the server will stop responding and refuse to start when the aolserver-error.log grows too big.
Jay, good point. However, more recent versions of OpenACS have built-in log-rolling. This used to be a very annoying problem!
just for a record,
the limit of the nsd-error.log file size appears to be 2147483647 bytes.

In oacs4.6.3 error log is not rotated automatically and when it (nsd-error.log file) reaches this size the nsd server process dies and will not restart. (oacs 4.6.3)

Jay,
that's not just the AOLserver log file size limit - it's the 2GB file size limit on Linux. I believe it may be possible to change this but have never done so.

Brian