Forum OpenACS Development: Disconnected PG clients hose the server

We have a number of cases where an ADP page calls
a long-running SQL query and the user either hits stop
or actually closes the browser.

We see that the Postgres process that does the SELECT
continues to churn large percentages of CPU time even
though the http request doesnt exists anymore.

What is the correct remedy to prevent zombie backend postmaster processes from hosing the CPU forever?

This seems like a pretty trivial thing to ask from the Web application server - tell the backend DB processor to backout of the query since the http thread went away.

Shouldnt there be some parameter that could be set in AOLserver
?
thanks
Danny

Collapse
Posted by Lamar Owen on
You say that you have an adp calling an sql select that runs for a significant length of time.  Are you communicating to the user that the query may take awhile, and to not hit stop?

Why is the select taking so long?

I think you might want to look at why you have this problem; while I understand what you're talking about, and even see a certain validity in having the functionality, I wonder if that is the 'correct' solution to your problem.

It isn't correct to call the postgres backend processes held open in this way a 'zombie' -- a zombie will never go away and cannot be killed, whereas these postgres processes will indeed finish eventually.

Now, AOLserver uses pooled database connections, which by design don't go away at all.  It is desireable for the postgres backends to stay up forever -- so really what we're after is a way of aborting a currently executing query, but keeping the connection open (taking the connection down and bringing it back up imposes some performance penalties, but the driver knows how to recover -- it just doesn't know how to drop the connection).

Furthermore, the driver is blocked during the query execution, so it can't do anything until the libpq query exec call returns.  Using the asynchronous interface to libpq might work, if it were threadsafe, though.  The synchronous call with the driver blocked prevents what you want from being implemented using the current database driver API, AFAICT.  Even if the async libpq interface can be used, AFAICS AOLserver doesn't have a mechanism for signalling the driver that the client has backed out.

Collapse
Posted by Dan Lieberman on
Lamar,
As a rule an SQL query on an ADP page should deliver the goods very quickly. We have a browser-based report writer application that incorporates a wizard that generates an SQL query - so there ARE cases where a user may wait for a minute or so to get the results when the query is very complicated and there is a fair volume of data to write to the browser. Unfortunately - you can tell a user not to hit stop and they will get pissed off and still hit stop and sometimes even hit stop and refresh several times!

I am tempted to redesign the report writer to run in a ns_schedule process and return the results in HTML by email.  This would probably make the problem go away.

Having said that - there is a curious problem with PG. When the user disconnects by STOP or closing the browser - the PG process running the SELECT NEVER finishes.
Thats why i called it a zombie (even tho' thats an incorrect definion by unix terms) - I would have thought that the process WOULD complete the SELECT and drop its results into the bit bucket.

This is whats really troubling me. It smells like a bug in the AOLServer PG driver
Danny

Collapse
Posted by Jonathan Ellis on
In this situation I've added a per-user non-blocking mutex  so stupider than average users can't have more than one expensive query at once...
Collapse
Posted by Lamar Owen on
Dan, I'll see if I can't trace through and see if there's something afoul in the handling of this situation in the nspostgres driver.  It may take awhile to thoroughly debug, however, if it is indeed a bug in the driver.
Collapse
Posted by Lamar Owen on
Ok, there's a thread going on right now on the pgsql-hackers list about this very issue.  It seems this may not be just an AOLserver thing, but is endemic to the whole PostgreSQL backend.  I'll report back when I know more.
Collapse
Posted by Michael Bluett on
Would the recent fix in PostgreSQL 7.3.2 ("Fix possible memory leak and core dump during disconnect in libpgtcl") have any bearing on this?