Forum OpenACS Development: Re: "interesting" db pool behavior
I've been looking more at the situation I'm trying to solve. Some of this is just thinking out loud, but someone else might encounter this problem one day so I thought it would be good to document it here.
The first step down the road to trouble is that keepalive's request of dbtest.tcl fails with a "no connection" error. Presumably the connection to the Oracle installation on the remote server has been lost. I suspect they shut down Oracle periodically for cold backups, though I've never been able to confirm that.
Keepalive restarts the site. The site comes up and spits out this error:
[29/Jan/2006:01:14:06][12638.16384][-main-] Notice: dbdrv: opening database 'oracle:'
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: nsoracle.c:1085:ora_open_db: error in `OCISessionBegin ()': ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: dbdrv: failed to open database 'oracle:'
[29/Jan/2006:01:14:06][12638.16384][-main-] Warning: Database API: couldn't allocate a handle from database pool "pool2".
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: Database API: RDBMS type could not be determined for any pool.
So, presumably Oracle is still not up on the remote server. The site is now displaying the Installation Error page where the bootstrap code reports the pool2 error to the user.
So far, all is behaving as expected. But now we come to the real problem. The keepalive script fails a couple more times, and then succeeds. Because it's set up to allow a few more failures before it restarts, a second restart is never triggered and the site is left in this broken state.
I find it rather puzzling that the db connection comes back to life at all. As Andrew pointed out, AOLserver doesn't usually do that. All I can guess is that I was able to get a valid connection to the local listener, which then sprang to life when the remote database started responding.
So, what to do? I could change keepalive's settings to restart every time it fails, but that would not necessarily solve the problem. It would still be possible for Oracle to come back to life after the db check failed but before the next time dbtest.tcl was requested, and we'd be in the same situation. In order to make the window small enough for this to be unlikely to happen I'd have to test very frequently, once a minute or less.
What I really need to do is to detect that the site is in this state and force a restart as soon as the database starts working again. Something like grabbing the home page and looking for the words "OpenACS could not allocate a handle" or something like that. Cludgy, but I'm not sure I have an alternative.
I would write a special purpose script to do exactly that. I don't know whether you could get AOLserver to work for answering that question, but writing a script using Oratcl or even Sqlplus to repeatedly try to connect to Oracle is definitely feasible.
I'd change your OpenACS code so that anytime the Oracle connection breaks, you serve some sort of, "Sorry, our Oracle database is down for maintenance." message, then immediately fire off the to script to poll Oracle until it's back up. Once the script says Oracle is back up, call ns_shutdown to restart. (Of course, if due to some bug or other problem the script always says, "Yes, Oracle is up.", but AOLserver's db connection is still broken, you'd go into an infinite loop of AOLserver restarts - ugh - so you need to handle those sorts of cases.)
I would change your external Keepalive to check only a *.tcl page which does not need any database connectivity at all. (Or just turn Keepalive off entirely, you probably don't really need it.)