Forum OpenACS Q&A: zombie perl processes

Collapse
Posted by R. Joseph Wright on
I continue to get this message emailed to me at regular intervals:
<br>
<br>
<pre>
18/Oct/2000:23:01:13]
    Error: nsd.tcl: error reading output from command: interrupted
system call
    error reading output from command: interrupted system call
        while executing
    "exec $command $options $error_log"
        (procedure "wd_errors" line 17)
        invoked from within
    "wd_errors $num_minutes"
        (procedure "wd_mail_errors" line 8)
        invoked from within
    "wd_mail_errors"
        ("eval" body line 1)
        invoked from within
    "eval [concat [list $proc] $args]"
        (procedure "ad_run_scheduled_proc" line 43)
        invoked from within
    "ad_run_scheduled_proc {f f 900 wd_mail_errors {} 971917273 0 t}"
    Notice: Running scheduled proc process_email_queue...
</pre>
<br>
<br>
Along with this I am getting large numbers of zombie perl processes at
regular intervals, the times of which coincide with some of these
messages.  The perl processes are running as the user "nsadmin" who is
the user I have set up to run aolserver.
I saw an old thread where someone had a similar problem but I there
was no resolution given.
Collapse
Posted by Cynthia Kiser on
The irony is that Watch Dog works some of the time - else you would not get the mails saying that it was failing occaisionally. I think you need to start debugging this by looking to see if you can make the perl scrip fail from the command line. Try logging in as your aolserver user and running /web/$server/bin/aolserver-errors.pl and see if you can get it to error - hopefully informatively this time. "interupted system call" could be the call to the perl script - or a call within it. I can't really tell in absence of a system to play with myself.
Collapse
Posted by David Eison on
I've only ever seen "Interrupted system call" on a misconfigured Mandrake system left up for more than a day - after about a day or so, most websites loaded would result in "Interrupted system call" the first time they were loaded.  If you figure out what causes it, I'd love to know.

Last I checked exec did not fail gracefully - you'll see the "zombie" behavior if you try to exec a program that doesn't exist, for example (ACS 3.3 did this a lot because the exec command was written improperly so it would try to run "aolserver-errors -15m" rather than "aolserver-errors" with a parameter of "-15m")

Collapse
Posted by R. Joseph Wright on
Cynthia, I am not getting the error message every time the script runs, but I am getting a zombie perl process every time.  Also, if you hadn't said it was the aolserver-errors.pl script, I wouldn't have known where to look, as there is no mention of it in the error message.  I've looked through it, and will continue to try and figure out what's wrong.
Collapse
Posted by Matthew Braithwaite on

What's wrong is sloppy code in AOLserver. Both zombie processes, and the message `Interrupted system call', are indications of an incorrect program, whenever they occur, in whatever program. So unless you're prepared to hack on AOLserver's code there isn't much you can do about this.

I despair of ever fixing these in AOLserver, because they need to be fixed in a jillion places: rather than an API that obscures EINTR, there are a bunch of system calls that all need to be wrapped individually. Some are, some aren't. The code betrays some slight awareness of the existence of SA_RESTART, but it isn't used.

I'm going to pop over to the AOLserver mailing list and try to make some constructive suggestions about this.