Andrew, you are correct - if I run my script from a scheduled proc, all three pools work.
I've been looking more at the situation I'm trying to solve. Some of this is just thinking out loud, but someone else might encounter this problem one day so I thought it would be good to document it here.
The first step down the road to trouble is that keepalive's request of dbtest.tcl fails with a "no connection" error. Presumably the connection to the Oracle installation on the remote server has been lost. I suspect they shut down Oracle periodically for cold backups, though I've never been able to confirm that.
Keepalive restarts the site. The site comes up and spits out this error:
[29/Jan/2006:01:14:06][12638.16384][-main-] Notice: dbdrv: opening database 'oracle:'
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: nsoracle.c:1085:ora_open_db: error in `OCISessionBegin ()': ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
SVR4 Error: 2: No such file or directory
SQL: [nil]
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: dbdrv: failed to open database 'oracle:'
[29/Jan/2006:01:14:06][12638.16384][-main-] Warning: Database API: couldn't allocate a handle from database pool "pool2".
[29/Jan/2006:01:14:06][12638.16384][-main-] Error: Database API: RDBMS type could not be determined for any pool.
So, presumably Oracle is still not up on the remote server. The site is now displaying the Installation Error page where the bootstrap code reports the pool2 error to the user.
So far, all is behaving as expected. But now we come to the real problem. The keepalive script fails a couple more times, and then succeeds. Because it's set up to allow a few more failures before it restarts, a second restart is never triggered and the site is left in this broken state.
I find it rather puzzling that the db connection comes back to life at all. As Andrew pointed out, AOLserver doesn't usually do that. All I can guess is that I was able to get a valid connection to the local listener, which then sprang to life when the remote database started responding.
So, what to do? I could change keepalive's settings to restart every time it fails, but that would not necessarily solve the problem. It would still be possible for Oracle to come back to life after the db check failed but before the next time dbtest.tcl was requested, and we'd be in the same situation. In order to make the window small enough for this to be unlikely to happen I'd have to test very frequently, once a minute or less.
What I really need to do is to detect that the site is in this state and force a restart as soon as the database starts working again. Something like grabbing the home page and looking for the words "OpenACS could not allocate a handle" or something like that. Cludgy, but I'm not sure I have an alternative.
Suggestions?