Forum OpenACS Development: Possible 10g Solaris Problem

Posted by Barry Books on 05/29/04 01:43 PM

I installed Openacs 5.1 on Solaris 9/10g and I've having a strange problem I'm trying to track down. Whenever the tcl exec function is used I get

error waiting for process to exit: child process lost (is SIGCHLD ignored or trapped?)

in my error log and an error on the page. This happened on installation but I thought the problem was cause by a change to sqlplus. I built the tcl Signal package and if I delete the SIGCHLD handler the problem goes away. I have not done enough testing to know if this is a fix.

The reason I think it's a 10g problem is I've run on 9i without the problem. I also found this Link which leads me to believe OCI may now send a SIGCHLD when database operations complete.

I don't fully understand the problem yet and I was curious if anyone else has seen anything like it.

2: Re: Possible 10g Solaris Problem (response to 1)

Posted by Andrew Piskorski on 05/30/04 07:38 AM

Barry, did you try the Tcl exec from your AOLserver without your Oracle driver loaded at all? If it works then, with no other changes, then probably your right that Oracle has something to do with it.

When you do load nsoracle, what if you try a Tcl exec after connecting successfully to Oracle but before doing any database queries at all. Does the exec work then?

Also, which version of AOLserver, Tcl, and nsoracle are you using?

The OTN link you gave above does not seem to work (due to some stupid Oracle session tracking stuff in the URL?), this one seems ok.

The guy posting the problem on OTN was awfully vague. He doesn't even say what version of Oracle he's running, never mind other important info like which of Oracle's three different connect methods he's using. Perhaps someone intimately familiar with OCI could infer what specifically he must have meant, but it all sounds sketchy to me:

I encountered the "defunct" problem when I try to connect oracle database using OCI. The connection is successful but after each connection there's a defunct process left in the system, which leads to the exhaust of system process resources. Any ideas or suggestion?
[...]
This problem has been settled. After each oci operation, the created database thread would send a signal SIGCHLD to the parent process to indicate the child process has finished doing the requested operation. In my former program I didn't do anything with this signal, which at last led to the child process becoming a defunct process, and the total number of which would increase as more and more OCI operations are called. That's why I've got so many defunct processes.
It's easy to settle this problem. You should handle the signal SIGCHLD in your process which performing the data base operations. The simplest way is adding the following lines into the program:
signal(SIGCHLD, SIG_IGN);
That's all. Hope the above information would be helpful to other people.

"created database thread", huh? Since when is there any such thing? Perhaps he means the Oracle connection process, and that (to translate into the OpenACS case), it is now sending SIGCHLD to the AOLserver process.

If so, and this is new behavior on Oracle's part, then some sort of signal handler somewhere in AOLserver probably does need to be changed to deal with this. I couldn't tell you where or how though. Also, maybe there is some way to get Oracle OCI not to generate the signal instead?

3: Re: Possible 10g Solaris Problem (response to 1)

Posted by Barry Books on 05/30/04 02:13 PM

I'm running aol 4_r3, tcl 8.4.6, and nsoracle 2.7. I do know exec works before the Oracle driver is loaded, but I haven't tracked down exactly when it quits. I don't think it's on a query because if I reset the signal everything seems to work from then on.

I'm not sure the Oracle link is relevant but it's the only thing I found. I did look up the OCI connect call and it does not say anything about signals.

4: Re: Possible 10g Solaris Problem (response to 1)

Posted by Bruno Mattarollo on 05/30/04 08:36 PM

Hello Barry,

I can't comment on the Solaris 9/Oracle 10g combination. I am trying now Oracle 10g on a redistribution of RHEL 3.0 and I not having the issue you describe. Using OpenACS 5.1 as well, AOLServer 4.01 and nsoracle 2.7 (I did a checkout from HEAD of nsoracle but I thought it might be already too much to be running 10g so I decided to be a bit more conservative).

I am having other strange issues, for example, when I start AOLServer, sometimes, I get these errors:

[30/May/2004:19:51:15][843.3073884288][-main-] Error: nsoracle.c:3526:Ns_OracleGetRow: error in `OCIStmtFetch ()': ORA-01406: fetched column value was truncated

SQL: [nil]

I can't seem to be able to reproduce this consistently and when I restart, in most cases, it works fine.

Is anyone using Oracle 10g with OACS 5.1?

5: Re: Possible 10g Solaris Problem (response to 1)

Posted by Simos Gabrielidis on 07/23/05 11:53 AM

Hello,

I was getting the same error message with AOLServer4.0.10 + nsoracle 2.7 under Mac OS X Tiger.

Since the issue is about the Oracle 10g client OCI, trapping the SIGCHLD signal, I modified both the oci_error_p and tcl_error_p in nsoracle.c (version 2.7 of the driver) to restore the signal, as following:

static int
oci_error_p (char *file, int line, char *fn,
	     Ns_DbHandle *dbh, char *ocifn, char *query,
	     oci_status_t oci_status)
{
  /* for info we get from Oracle */
  char *msgbuf;
  /* what we will actually print out in the log */
  char *buf;
  /* exception code for Ns_DbSetException */
  char exceptbuf[EXCEPTION_CODE_SIZE + 1];
  ora_connection_t *connection = 0;
  ub2 offset = 0;
  sb4 errorcode = 0;
  
  if (dbh)
    connection = dbh->connection;
  
  /* Restore SIGCHLD since Oracle10 client has trapped it **SG** */
  signal(SIGCHLD, SIG_DFL);

My modification is in bold. I applied exactly the same one-liner to tcl_error_p as well. You may also need to include the signal header file at the top of nsoracle.c


/* Signal processing interface */
#include <sys/signal.h>

I hope this can be of some help to people experiencing the "error waiting for process to exit: child process lost (is SIGCHLD ignored or trapped?)" problem.