Forum OpenACS Q&A: Daemontools problems and nsd not dying

We have a dotLRN server running on Oracle on RedHat 7.2 that
daemontools is not able to start properly. When putting the server
under supervise the nsd processes are created but no logs are ever
created and the server is not responding. When executing the run
script from the command-line the server comes up nicely. However, we
noticed that when hitting C-c to kill the server one or two nsd
processes would be left around and refuse to die. Sometimes however,
hitting C-C would successfully kill all nsd processes of the server.
Our suspicion is that it is this problem to stop AOLServer that is
confusing Daemontools but there might be other problems involved as
well. To give you some more information, here are the commands I am
using and the output that I get:

command line:

/usr/local/aolserver/bin/nsd-oracle -fzt
/web/dotlrn-demo/dotlrn-demo.tcl -u nsadmin -g web

nsd-oracle script:

#!/bin/sh

#source /etc/shell-mods.sh

export ORACLE_BASE=/ora8/m01/app/oracle
export ORACLE_HOME=$ORACLE_BASE/product/8.1.7
export PATH=$PATH:$ORACLE_HOME/bin:$ORACLE_HOME/ctx/lib:.
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME/lib:$ORACLE_HOME/jdbc/lib:$ORACLE_HOME/ctx/lib
export ORACLE_SID=ora8
export ORACLE_TERM=vt100
export NLS_DATE_FORMAT=YYYY-MM-DD
export ORA_NLS33=$ORACLE_HOME/ocommon/nls/admin/data
export NLS_LANG=.UTF8

exec /usr/local/aolserver/bin/nsd $*

After running the command line above and hitting C-c I see:

[22/Jul/2002:13:00:00][20483.3076][-thread3076-] Warning: nsunix:
accept(12) failed: Interrupted system call(4)
[22/Jul/2002:13:00:00][20483.3076][-thread3076-] Error: nsunix:
drvaccept: accept returned error: Interrupted system call(4)
[22/Jul/2002:13:00:00][20483.3076][-thread3076-] Error: nsunix:
DriverContext is:
location: http://www.collaboraid.biz
host: www.collaboraid.biz
    port:80
udsFilename:modules/nsunix/dotlrn-demo.nsunix
listenSocket: 12
: lock:138326456
[22/Jul/2002:13:00:00][20483.3076][-thread3076-] Error: drv: driver
'nsunix' failed: error -1
[22/Jul/2002:13:00:00][20483.1024][-main-] Notice: nsmain:
AOLserver/3.3.1+ad13 stopping
[22/Jul/2002:13:00:00][20483.1024][-main-] Notice: nssock: triggering
shutdown
[22/Jul/2002:13:00:00][20483.1024][-main-] Notice: serv: stopping
connection threads
[22/Jul/2002:13:00:00][20483.4101][-nssock-] Notice: exiting
[22/Jul/2002:13:00:00][20483.1024][-main-] Notice: serv: connection
threads stopped
[22/Jul/2002:13:00:00][20483.1024][-main-] Notice: sched: shutdown pending
[22/Jul/2002:13:00:00][20483.2051][-sched-] Notice: Running scheduled
proc search_indexer...
[22/Jul/2002:13:00:00][20483.2051][-sched-] Notice: Done running
scheduled proc search_indexer.
[22/Jul/2002:13:00:00][20483.2051][-sched-] Notice: sched: shutdown
started
[22/Jul/2002:13:00:00][20483.2051][-sched-] Notice: sched: shutdown
complete
[22/Jul/2002:13:00:00][20483.8196][-shutdown-] Notice: nslog: closing
'/web/dotlrn-demo/log/access.log'
[22/Jul/2002:13:00:00][20483.8196][-shutdown-] Notice: nssock:
shutdown complete
[22/Jul/2002:13:00:00][20483.1024][-main-] Notice: nsmain:
AOLserver/3.3.1+ad13 exiting

and there are no processes left. However, sometimes when hitting C-c I
get only:

[22/Jul/2002:13:04:24][20625.2051][-sched-] Notice: Done running
scheduled proc search_indexer.
[22/Jul/2002:13:04:35][20625.1024][-main-] Notice: nsmain:
AOLserver/3.3.1+ad13 stopping
[22/Jul/2002:13:04:35][20625.1024][-main-] Notice: nssock: triggering
shutdown

and there will be two processes left:

nsadmin  20631  0.0  2.4 30488 25444 pts/5  S    13:03  0:00
/usr/local/aolserver/bin/nsd -fzt /web/dotlrn-demo/dotlrn-demo.tcl -u
nsadmi...
nsadmin  20632  0.4  2.4 30488 25444 pts/5  S    13:03  0:00
/usr/local/aolserver/bin/nsd -fzt /web/dotlrn-demo/dotlrn-demo.tcl -u
nsadmi...

Those processes need to be killed with -9. At yet other times I will get:

[22/Jul/2002:13:07:11][20654.2051][-sched-] Notice: Done running
scheduled proc search_indexer
[22/Jul/2002:13:07:21][20654.1024][-main-] Notice: nsmain:
AOLserver/3.3.1+ad13 stopping
[22/Jul/2002:13:07:21][20654.1024][-main-] Notice: nssock: triggering
shutdown
[22/Jul/2002:13:07:21][20654.1024][-main-] Notice: serv: stopping
connection threads
[22/Jul/2002:13:07:21][20654.1024][-main-] Notice: serv: connection
threads stopped
[22/Jul/2002:13:07:21][20654.1024][-main-] Notice: sched: shutdown pending
[22/Jul/2002:13:07:21][20654.2051][-sched-] Notice: sched: shutdown
started
[22/Jul/2002:13:07:21][20654.2051][-sched-] Notice: sched: shutdown
complete

and there is one process left.

Has anybody else had similar problems. I noticed Dave Bowers and Simon
Millwards posts ( see
https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=0003RT&topic_id=11&topic=OpenACS
) mentioning that they sometimes had one nsd process left that refused
to die.

Collapse
Posted by Jade Rubick on
I really have no idea, but did you check that you don't have
Aolserver starting from any other method, like /etc/inittab?
Collapse
Posted by Sanjeev Sharma on
Is it a live process or a zombie?

Specifically, does ps show <defunct>?

if it does I think you can ignore it until you start getting dozens.

Collapse
Posted by Stephen . on
I haven't had this problem (yet...), but I'm not using nsunix.  Do you need nsunix on this server or is that a cut 'n paste error in the config file?
Collapse
Posted by Arjun Sanyal on
peter: try "nsd -zit". The "-f" flag is for running aolserver the foreground.
Collapse
Posted by Lars Pind on
Peter and I were troubleshooting this together this morning, so I'll post the reply here:

1) AOLserver's not being started by any other method. We can start it on the command line with -f, and kill it with C-c, and one or two processes will usually keep running. It's like they're busy. They don't respond to a regular "kill" command, only to "kill -9" (of the ones we've tried).

2) The process looks like to "ps aux", but it doesn't respond to requests to the IP and port number, though it doesn't refuse connections, either.

3) Yes, we do need nsunix, as we're doing virtual hosting on this box. It works perfectly for all other servers on the box.

4) Daemontools requires that you run AOLserver in the foreground, as it pipes the output on stdout to the log process (multilog). If the process moves itself into the background, like it does with -i, daemontools will assume that the process quit, and start a new process, until you have hundreds of them :)

Normally when you stop AOLserver, when you hit C-c, you'll see "stopping", and then about a dozen lines, then "exiting". In this situation, sometimes there's only that one single "stopping" line, sometimes a handful more, but we never see "exiting". So appararantly some of the processes are not repsonding to the signal to kill themselves. After control has returned to the shell, you can still get output from the processes with e.g. scheduled procs being run.

Hope this clarifies things. We're really at a loss here. Btw, it's a rackspace box, and they have their own compilations of Red Hat, which *may* be causing problems. Other bit of info: It's a dual-processor box. Not sure if that could've anything to do with it.

Thanks so much for your help.

/Lars

Collapse
Posted by Gilbert Wong on

Lars,

I use the -i option and do not get multiple nsd processes and aolserver generates the normal log files. Here is the command I use in my daemontools run file (freebsd 4.x, AOLServer 3.4):

#!/bin/sh

exec /usr/local/aolserver/bin/nsd -it /usr/local/aolserver/openacs-4.tcl -u nsadmin -g nsadmin 2>&1

Then again, when I run the script in the foreground, I am also able to kill the server with control-C. When I use "svc -d openacs-4", it might take some time before the server shuts down and sometimes I will need to use "svc -t openacs-4" to kill the remaining threads.

Collapse
Posted by Jun Yamog on
Hi,

We also use deamontools.  Just like gilbert we use -i.  The only problem is with a server restart.  Most of the time deamon tools will up nsd but the db (postgres or oracle) is not yet up.  So we need to start some of those servers using svc -t.  Hamilton is going to make a init.d script to help those.  The basic idea it do have all nsd down by default then after the postgres or oracle runscript run svc -u on those servers.

Collapse
Posted by Tomasz Kosiak on
Difference between -i and -f is that -f writes server log to standard output and -i writes server log directly to file. Both don't put aolserver in the background. I use -f with daemontools in order to manage server log with multilog.

When I had used aolserver-3.2ad8 with oracle on RedHat 6.2 it restarted with no problems on  single CPU server and there where problems on dual CPU server. On dual CPU before starting new server I have to wait 4 seconds (sleep 4) for Linux to free socket of killed servers. I remember that I have to do kill -SIGTERM and than kill -SIGKILL. Those days I used init to control nsd processes and quite often I have to kill extra init processes or even to reboot the system because init stop responding to signals. I also remember that when using init you have to redirect stdout & stderr to file or you run into problems with init - it stops responding.

Now I mostly use RedHat 7.2 and aolserver-3.4 with postgres under daemontools. I noticed that somtimes I have to give svc -t twice: first time it terminates connections and the second  time it terminates the server. Anyway you can allways run svc -k.

Maybe my expirences will be somehow helpfull.

Collapse
Posted by Malte Sussdorff on
We use the same method as Gilbert and wondered why our AOLserver wont TERM (svc -t). It seems like there really are some busy processes which do not respond to the TERM signal, so using svc -k gets your server restarted very nicely (and svc -d works as well). Maybe we should stress this at the daemontools installation instructions on our site.
Collapse
Posted by Tom Jackson on

When using daemontools on a heavily loaded AOLserver, you need to use svc -k, or be willing to wait a few minutes or more for server restart if you use svc -t. I think you might also try moving the contents of your nsd-oracle to your run file that daemontools uses. Also, use exec in your call to start nsd. This ensures that supervise is actually controlling nsd and not a shell script. Here is what one of my run scripts looks like, for Linux:

#!/bin/sh
 
sleep 4

export ORACLE_HOME="/ora8/m01/app/oracle/product/8.1.6"
export ORACLE_BASE="/ora8/m01/app/oracle"
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$ORACLE_HOME/ctx/lib:/usr/lib:/lib:/usr/X11R6/lib
export PATH=$ORACLE_HOME/bin:$ORACLE_HOME/ctx/lib:$PATH
export ORACLE_SID='ora8'
export ORACLE_TERM='vt100'
export ORAENV_ASK=NO
export NLS_DATE_FORMAT="YYYY-MM-DD"
 
exec /home/aol31/bin/nsd -ic /home/aol31/servers/choicedev/choicedev.ini -u nsadmin -g web 

I have more instructions at http://zmbh.com/daemontools-aolserver/.

Collapse
Posted by Janine Ohmer on
We use inittab rather than daemontools but we occasionally see the same thing.  I have always assumed that the threads that don't die are ones which are busy talking to the database, or think they are.  I suspect that sometimes a connection gets confused, and stays in limbo forever.
Collapse
Posted by Mat Kovach on
I run into this quite a bit.  If you just send a SIGTERM to the process(es) you will have to wait for threads to close (this happens quite often on the uptime server).  It can sometimes take a while for a scheduled proc or thread to die.

sending a SIGKILL will get rid of a "stuck" process but you should see why that process is stuck.

Collapse
Posted by Stephen . on
Does the response time change if you adjust the shutdowntimeout parameter?
ns_section "ns/parameters"
ns_param shutdowntimeout 20
Collapse
15: Has this been resolved? (response to 1)
Posted by Sanjeev Sharma on
When this happens again perhaps you could run lsof to find out if aolserver is blocking on something.
Collapse
Posted by Lars Pind on
This still happens from time to time. Here's an lsof of the hanging processes. Does this mean anything to anybody?
[root@cph01 root]# lsof |grep 21815
nsd       21815     root  cwd    DIR       3,65      4096    2654585 /usr0/web/collaboraid-dev
nsd       21815     root  rtd    DIR        3,2      4096          2 /
nsd       21815     root  txt    REG        3,2   2800256    1253646 /usr/local/aolserver-3.3.1-ad13/bin/nsd8x
nsd       21815     root  mem    REG        3,2    485171    1007908 /lib/ld-2.2.4.so
nsd       21815     root  mem    REG        3,2     65997    1007921 /lib/libdl-2.2.4.so
nsd       21815     root  mem    REG        3,2    530021      48812 /lib/i686/libpthread-0.9.so
nsd       21815     root  mem    REG        3,2    622317      48810 /lib/i686/libm-2.2.4.so
nsd       21815     root  mem    REG        3,2   5772268      48808 /lib/i686/libc-2.2.4.so
nsd       21815     root  mem    REG        3,2    261460    1007942 /lib/libnss_files-2.2.4.so
nsd       21815     root  mem    REG        3,2    355236    1007950 /lib/libnss_nisplus-2.2.4.so
nsd       21815     root  mem    REG        3,2    436784    1007926 /lib/libnsl-2.2.4.so
nsd       21815     root  mem    REG        3,2     72296    1007939 /lib/libnss_dns-2.2.4.so
nsd       21815     root  mem    REG        3,2    261196    1007955 /lib/libresolv-2.2.4.so
nsd       21815     root    0r   CHR        1,3                66235 /dev/null
nsd       21815     root    1w  FIFO        0,0              1123523 pipe
nsd       21815     root    2w  FIFO        0,0              1123523 pipe
nsd       21815     root    3u  unix 0xddee9a40              1771226 socket
nsd       21815     root    4u  unix 0xddee9040              1771227 socket
nsd       21815     root    5u  unix 0xc68ab560              1771228 socket
nsd       21815     root    6u  unix 0xd4ebb540              1771229 socket
nsd       21815     root    7r  FIFO        0,0              1771230 pipe
nsd       21815     root    8w  FIFO        0,0              1771230 pipe
nsd       21815     root    9r  FIFO        0,0              1771231 pipe
nsd       21815     root   10w  FIFO        0,0              1771231 pipe


[root@cph01 root]# lsof |grep 21820
nsd       21820     root  cwd    DIR       3,65      4096    2654585 /usr0/web/collaboraid-dev
nsd       21820     root  rtd    DIR        3,2      4096          2 /
nsd       21820     root  txt    REG        3,2   2800256    1253646 /usr/local/aolserver-3.3.1-ad13/bin/nsd8x
nsd       21820     root  mem    REG        3,2    485171    1007908 /lib/ld-2.2.4.so
nsd       21820     root  mem    REG        3,2     65997    1007921 /lib/libdl-2.2.4.so
nsd       21820     root  mem    REG        3,2    530021      48812 /lib/i686/libpthread-0.9.so
nsd       21820     root  mem    REG        3,2    622317      48810 /lib/i686/libm-2.2.4.so
nsd       21820     root  mem    REG        3,2   5772268      48808 /lib/i686/libc-2.2.4.so
nsd       21820     root  mem    REG        3,2    261460    1007942 /lib/libnss_files-2.2.4.so
nsd       21820     root  mem    REG        3,2    355236    1007950 /lib/libnss_nisplus-2.2.4.so
nsd       21820     root  mem    REG        3,2    436784    1007926 /lib/libnsl-2.2.4.so
nsd       21820     root  mem    REG        3,2     72296    1007939 /lib/libnss_dns-2.2.4.so
nsd       21820     root  mem    REG        3,2    261196    1007955 /lib/libresolv-2.2.4.so
nsd       21820     root    0r   CHR        1,3                66235 /dev/null
nsd       21820     root    1w  FIFO        0,0              1123523 pipe
nsd       21820     root    2w  FIFO        0,0              1123523 pipe
nsd       21820     root    3u  unix 0xddee9a40              1771226 socket
nsd       21820     root    4u  unix 0xddee9040              1771227 socket
nsd       21820     root    5u  unix 0xc68ab560              1771228 socket
nsd       21820     root    6u  unix 0xd4ebb540              1771229 socket
nsd       21820     root    7r  FIFO        0,0              1771230 pipe
nsd       21820     root    8w  FIFO        0,0              1771230 pipe
nsd       21820     root    9r  FIFO        0,0              1771231 pipe
nsd       21820     root   10w  FIFO        0,0              1771231 pipe



[root@cph01 root]# lsof |grep 21821
nsd       21821     root  cwd    DIR       3,65      4096    2654585 /usr0/web/collaboraid-dev
nsd       21821     root  rtd    DIR        3,2      4096          2 /
nsd       21821     root  txt    REG        3,2   2800256    1253646 /usr/local/aolserver-3.3.1-ad13/bin/nsd8x
nsd       21821     root  mem    REG        3,2    485171    1007908 /lib/ld-2.2.4.so
nsd       21821     root  mem    REG        3,2     65997    1007921 /lib/libdl-2.2.4.so
nsd       21821     root  mem    REG        3,2    530021      48812 /lib/i686/libpthread-0.9.so
nsd       21821     root  mem    REG        3,2    622317      48810 /lib/i686/libm-2.2.4.so
nsd       21821     root  mem    REG        3,2   5772268      48808 /lib/i686/libc-2.2.4.so
nsd       21821     root  mem    REG        3,2    261460    1007942 /lib/libnss_files-2.2.4.so
nsd       21821     root  mem    REG        3,2    355236    1007950 /lib/libnss_nisplus-2.2.4.so
nsd       21821     root  mem    REG        3,2    436784    1007926 /lib/libnsl-2.2.4.so
nsd       21821     root  mem    REG        3,2     72296    1007939 /lib/libnss_dns-2.2.4.so
nsd       21821     root  mem    REG        3,2    261196    1007955 /lib/libresolv-2.2.4.so
nsd       21821     root    0r   CHR        1,3                66235 /dev/null
nsd       21821     root    1w  FIFO        0,0              1123523 pipe
nsd       21821     root    2w  FIFO        0,0              1123523 pipe
nsd       21821     root    3u  unix 0xddee9a40              1771226 socket
nsd       21821     root    4u  unix 0xddee9040              1771227 socket
nsd       21821     root    5u  unix 0xc68ab560              1771228 socket
nsd       21821     root    6u  unix 0xd4ebb540              1771229 socket
nsd       21821     root    7r  FIFO        0,0              1771230 pipe
nsd       21821     root    8w  FIFO        0,0              1771230 pipe
nsd       21821     root    9r  FIFO        0,0              1771231 pipe
nsd       21821     root   10w  FIFO        0,0              1771231 pipe
/Lars