Forum OpenACS Q&A: postgresql driver taking nsd down?

Collapse
Posted by Jonathan Ellis on
I've noticed that whenever lines like these appear in /var/log/messages, nsd dies immediately after the pq_recvbuf messages finish arriving:
logger: DEBUG:  pq_recvbuf: unexpected EOF on client connec\
tion
last message repeated 12 times
the number of messages varies and may or may not be the number of backends connected to nsd at the time.

I went through my logs after noticing this and it's happened dozens of times. I switched to nspostgres 3.5 after being on 2.01, but it is still happening. (Postgresql 7.2.3-RH and nsd 3.5.1 if it matters.)

I've run with Verbose and there seems to be no pattern as to any particular query causing problems. I ran several manually and had no problems. postmaster and backends appear to be fine.

I've seen no decrease in frequency after doubling the TCL stack size to 256k. /proc says I've never gotten close to my filehandle max (16k) so that shouldn't be a problem. No core is dumped, which is a shame.

It looks like Jon Griffin had the same problem a few months ago but nothing was really resolved.

Collapse
Posted by Jon Griffin on
Collapse
Posted by Jonathan Ellis on
looks like I have a different problem. :/

the pg disconnect seems to be a symptom of nsd dying, unrelated to the actual cause.  so I'm back to flailing around in the dark.

I ran under gdb for a while but of course it refused to cooperate and crash.  Any ideas how to get nsd to dump core when it goes down?  (ulimit is set to 1GB so that's not the problem.  and other programs are happy to dump core on segfault...  just not nsd.)

Collapse
Posted by Andrew Piskorski on
Jonathan, to get AOLserver to dump core, on Solaris I run start it as a NON-root user form /etc/inittab, like this:
/bin/su nsadmin -c "/web/aol3/bin/nsd-oracle -i -t /web/mysite-staging/nsd.tcl"

If you need to start AOLserver as root (e.g., to listen on port 80), there are other ways, but I don't remember them. It's all in the man pages somewhere though, and I think has been discussed here before too.

Collapse
Posted by Jonathan Ellis on
I'm running on a linux 2.4.18 system.  Like I said, other programs will happily dump core for me, both as root and nsadmin.  So I'm not sure why nsd doesn't...
Collapse
Posted by Jonathan Ellis on
looks like I may be getting bit by the (in)famous gcc 2.96 included with mandrake 8.0. :/

when I ran nsd under gdb, it would consistently crash in TclParseBackslash which is a long-ish function to scan for potential errors when you have very little idea how it's supposed to work. :0

So I recompiled tcl 8.4.1 with -g instead of -O2, and now it says it's crasing on a completely innocuous line of ParseComment (also in tclParse.c).  Specifically,

scanned = TclParseWhiteSpace(p, numBytes, parsePtr, &type);

The only conclusion I come to is that gcc and/or gdb is smoking something.  I'm not sure yet how to go about upgrading gcc on this system but I suspect Dependency Hell will be far too mild a description. :/