Forum OpenACS Development: AolServer 3.4 dies frequently

Collapse
Posted by Jon Griffin on
I updated a system about 3 weeks ago and am using gentoo from scratch compiles. I doubt the system is the problem and I am suspecting the PG driver or PG/OACS interaction. Any help would be appreciated.

System is AolServer 3.4
OpenACS 4.5 and some using dev branch
PG 7.2.1
Apache mod_proxy for the proxy front end

Everything loaded and started ok. I have been hitting pages and I get AolServer restarting frequently. Looking in the nsd log I don't see any pattern even with debugging turned all the way on, and looking in the PG log I get:


 pq_recvbuf: unexpected EOF on client connection
@400000003d5490ea100cb9fc DEBUG:  pq_recvbuf: unexpected EOF on 
client connection
@400000003d5490ea100cc5b4 DEBUG:  pq_recvbuf: unexpected EOF on 
client connection
@400000003d5490ea100cd16c DEBUG:  pq_recvbuf: unexpected EOF on 
client connection
@400000003d5490ea100cd93c DEBUG:  pq_recvbuf: unexpected EOF on 
client connection
@400000003d5490ea100ce4f4 DEBUG:  pq_recvbuf: unexpected EOF on 
client connection
I thought it was the Apache proxy but it really is AolServer restarting. I am beginning to think it is another PG or resource issue although if resources were low other stuff would wack out.

Also, there appears to be an error in search_observer_dequeue that I haven't had time to look at.

ERROR:  parse error at or near ""
@400000003d54904406096324 NOTICE:  plpgsql: ERROR during compile of 
__exec_20506_search_observer_de near line 2
Collapse
Posted by carl garland on
An interesting aside note ... On my alpha production server I have had PG 7.1x running for over 7 months while my AOLserver.3.3.1 has probably been restarted (on purpose never crashed) about 30 times. The VSZ size of the postmaster process has grown to about 275 Meg / RSS < 10 MEG for each open postgres process (no timeout on connection threads in setup)
I often receive the Unexpected EOF in LOG but have never had a crash/restart with the nsd process. Although I am running 7.6 so stability could be 8.x problem.
Collapse
Posted by Patrick Giagnocavo on
Jon,

There are a bunch of things to try, I am sure you have already tried some of them.

The pq_recvbuf message is a generic one.

First run ulimit -a (if running ksh) or the related limit command for your shell.  Try increasing the max datasize, number of file handles, etc.

Increase the stack size in the nsd.tcl config file.  Be sure the MaxOpen and MaxIdle settings for the database connection is set.

It could be an issue with the version of glibc.

My question would be whether increasing the number of threads and/or db connections (maxthreads) would change the behavior; perhaps a db connection or a thread is getting stuck with each search_observer_dequeue problem.

Collapse
Posted by Lamar Owen on

Having hacked on the PG driver in the past, I would wonder about it. The current means of statically linking libpq.a may be the culprit here, as libpq.a isn't built -fPIC as a library statically linked to a dynamic loaded shared object should be. The driver should be dynamically linked to libpq -- and the make structure should be changed to make this the default.

IT could be this issue -- it could also be postmaster dropping out from under the driver, which the driver has never really handled very well. Hack the driver and see if you can find a leakage.

It may even be libpq not really being threadsafe -- can we say 'proxy daemon' -- eeewwwww, that was ugly sounding. Incidentally, libpq is not advertised as being threadsafe -- going the proxy daemon route may be the safest way of doing it. Don? Dan?

Was there a nsdb API change between 3.3.x and 3.4.x?

Scott Goodwin is hacking on the AOLserver fork of the OpenACS PostgreSQL driver, and has substantially reorganized the code. Is it worth looking at a resync here? What's the CVS module name for the current OpenACS pgdriver, so I can check out a HEAD of it?

Again, I say these things having more than just a passing familiarity with that driver. Don made good progress on it -- but maybe it's time to revisit how the driver is handling things. It certainly is the most taxed of all the pieces of the OpenACS puzzle.

Collapse
Posted by Don Baccus on
Well, the driver's extremely stable with 3.3 ... if PIC were a problem it wouldn't run at all, so I can't see that being the problem.

As far as threadsafe I did check with the PG crew and the basic operations were thought to be threadsafe, at least that was the theory a couple of years ago.  There are a couple of places where it's known not to be but they're not things that the driver uses.

So I think we're OK in this regard, too.

When the Postmaster dies out from under it (not a backend, the driver spawns a new connection when a backend crashes) all bets are off, Lamar's absolutely right there.  Ironically I just had this happen on my personal birdnotes.net server two or three days ago.  The first time ever after three years.  The Postmaster was there but not responding, very strange.  You do have to restart Postmaster and restart AOLserver.

The same is true of the Oracle driver, as I learned when working on the Greenpeace project.  We learned how to reliably crash an Oracle server running on a Solaris box from a client using the OCI on a linux box (OpenACS in this case).  Just set a blob to "empty_blob" and Oracle drops like its been shot in the head.  Literally.  On the server.  Any user can kill it if they have the right to update a table with a blob in it.

Cool.

I think Jon's tracked his situation down to problems doing old-style index.vuh internal redirects even though he's not using performance mode.

Collapse
Posted by Lamar Owen on
Don:
<p>
I remember the query about libpq's threadsafeness quite well; but,
as you said, that's been two or three years ago, and I guess it's
possible it could have broken in some way.
<p>
However, if Jon thinks he has the problem tracked down a little,
that's good.
Collapse
Posted by Jon Griffin on
I guess I forgot to update this thread. The problem was the lack of /global/error pages.

I have had no problems since I added these. I also believe they were added to either the head or 4.6 release.

Collapse
Posted by Jonathan Ellis on
you mean ServerInternalErrorResponse, etc., parameters?
Collapse
Posted by Jonathan Ellis on
...because I set those up last night and it's still crashing.