Forum OpenACS Development: "interesting" db pool behavior

Posted by Janine Ohmer on 01/26/06 06:58 AM

I have a client site which uses Oracle on a separate server across a not-very-reliable network. The site periodically goes into a state where when you try to access the site you get the bootstrap error page, complaining about being unable to allocate a handle from "pool2" (this site happens to have pools named pool1, pool2 and pool3). Meanwhile, keepalive is busily testing the database and finding no problems at all, because the other pools are working.

I found this behavior bizarre, but I wasn't going to spend a lot of time debugging something on a network I have no control over. So my bright idea was to modify the dbtest.tcl script to access all three pools, figuring I'd catch any strange errors that way.

What I found is that on every site I have tried, both Oracle and postgres, two of the pools work fine and the third one throws an error, all the time as far as I can tell.

Some of my sites have pool1/pool2/pool3, defined in that order with pool1 as the default, and on those sites it is always pool2 that fails. Other sites have main/log/subquery, with main as default, and on those it is the subquery pool that fails. So it is not as simple as "it's always the second pool". Also I have tried changing the default pool from main to subquery but it is still only the subquery pool that fails.

The exact error is "could not allocate 1 handle from pool "subquery"". As I recall from the ACS 3.x days this usually happens when the current script already has a handle open from that pool, but that is not the case here. The script is located in www/SYSTEM, which (I think) means it is bypassing templating completely.

At this point I've sort of lost interest in my original problem and am fascinated by this unexpected behavior. Why is this happening? What does it mean? Does it indicate some kind of problem? Enquiring minds want to know these things! :)

Any ideas of what might be going on here or how to debug it further are welcome.

2: Re: "interesting" db pool behavior (response to 1)

Posted by Andrew Piskorski on 01/26/06 04:23 PM

The OpenACS request processor normally allocates at least one db handle in order to do authentication checks and the like, no? So my guess is that it is still holding that handle at the time you try to allocate another from the same pool, which of course breaks, as AOLserver does not allow that.

Note there is no particular reason to do that sort of check in a conn thread, did you try it from the scheduler thread instead? You can run that internally on the AOLserver, there is no need to have a separate keepalive server hitting a Tcl page to do the same thing.

Even better might be to also have AOLserver directly monitor an open TCP/IP connection to the remote Oracle server, and somehow get immediate notification when the network flakes out, although I don't know if that's feasible.

Clearly what you really want for this client is to fix AOLserver so that it can re-connect to Oracle without restarting AOLserver. Many people have wanted that over the years, but in normal use you almost never really need it so AFAIK no one actually implemented the feature.

(I won't ask why this client insists on keeping their main OpenACS database instance on the other side of a flaky network... which seems insane.)

3: Re: "interesting" db pool behavior (response to 1)

Posted by Janine Ohmer on 01/29/06 10:43 PM

Andrew, you are correct - if I run my script from a scheduled proc, all three pools work.

I've been looking more at the situation I'm trying to solve. Some of this is just thinking out loud, but someone else might encounter this problem one day so I thought it would be good to document it here.

The first step down the road to trouble is that keepalive's request of dbtest.tcl fails with a "no connection" error. Presumably the connection to the Oracle installation on the remote server has been lost. I suspect they shut down Oracle periodically for cold backups, though I've never been able to confirm that.

Keepalive restarts the site. The site comes up and spits out this error:

[29/Jan/2006:01:14:06][12638.16384][-main-] Notice: dbdrv: opening database 'oracle:'
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: nsoracle.c:1085:ora_open_db: error in `OCISessionBegin ()': ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory

SQL: [nil]
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: dbdrv: failed to open database 'oracle:'
[29/Jan/2006:01:14:06][12638.16384][-main-] Warning: Database API: couldn't allocate a handle from database pool "pool2".
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: Database API: RDBMS type could not be determined for any pool.

So, presumably Oracle is still not up on the remote server. The site is now displaying the Installation Error page where the bootstrap code reports the pool2 error to the user.

So far, all is behaving as expected. But now we come to the real problem. The keepalive script fails a couple more times, and then succeeds. Because it's set up to allow a few more failures before it restarts, a second restart is never triggered and the site is left in this broken state.

I find it rather puzzling that the db connection comes back to life at all. As Andrew pointed out, AOLserver doesn't usually do that. All I can guess is that I was able to get a valid connection to the local listener, which then sprang to life when the remote database started responding.

So, what to do? I could change keepalive's settings to restart every time it fails, but that would not necessarily solve the problem. It would still be possible for Oracle to come back to life after the db check failed but before the next time dbtest.tcl was requested, and we'd be in the same situation. In order to make the window small enough for this to be unlikely to happen I'd have to test very frequently, once a minute or less.

What I really need to do is to detect that the site is in this state and force a restart as soon as the database starts working again. Something like grabbing the home page and looking for the words "OpenACS could not allocate a handle" or something like that. Cludgy, but I'm not sure I have an alternative.

Suggestions?

6: Re: Re: "interesting" db pool behavior (response to 3)

Posted by Andrew Piskorski on 01/30/06 10:42 AM

Janine, since apparently you can't talk to the people maintaining Oracle on the other end, and are stuck doing things solely on your end, the main problem seems to be, "Once Oracle goes down, how do I know when it's back up, reliably and ASAP?"

I would write a special purpose script to do exactly that. I don't know whether you could get AOLserver to work for answering that question, but writing a script using Oratcl or even Sqlplus to repeatedly try to connect to Oracle is definitely feasible.

I'd change your OpenACS code so that anytime the Oracle connection breaks, you serve some sort of, "Sorry, our Oracle database is down for maintenance." message, then immediately fire off the to script to poll Oracle until it's back up. Once the script says Oracle is back up, call ns_shutdown to restart. (Of course, if due to some bug or other problem the script always says, "Yes, Oracle is up.", but AOLserver's db connection is still broken, you'd go into an infinite loop of AOLserver restarts - ugh - so you need to handle those sorts of cases.)

I would change your external Keepalive to check only a *.tcl page which does not need any database connectivity at all. (Or just turn Keepalive off entirely, you probably don't really need it.)

4: Re: "interesting" db pool behavior (response to 1)

Posted by Torben Brosten on 01/29/06 11:12 PM

Someone recently posted a shell script that verifies a local pg's postmaster is running before starting aolserver.

Maybe there is a way to modify Keepalive to only restart aolserver after it verifies a connection to the db can be made?

5: Re: "interesting" db pool behavior (response to 1)

Posted by Janine Ohmer on 01/29/06 11:26 PM

Hmm, that's a good idea. Sort of attacking the problem before it happens. I'll look into that, thanks.