Forum OpenACS Development: Watchdog fails to run if database is broken

I recently encountered an interesting failure mode in Watchdog. This was in our lightly modified OpenACS 5.2.3 intranet, but in CVS the OpenACS Monitoring package hasn't been changed since well before then, so this should still apply in OpenACS 5.4.3 and later.

We schedule Watchdog to run every 5 minutes. However, last Saturday, it didn't run at all for this 3 hour period:

[15/Nov/2008:13:35:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:13:40:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...

[15/Nov/2008:14:22:07][14058.1094724224][-conn:outpost-prod::23] Error: nsoracle.c:2994:Ns_OracleOpenDb: error in `OCISessionBegin ()': ORA-00257: archiver error. Connect internal only, until freed.

[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Errors found.
[15/Nov/2008:16:48:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...

The reason is that Oracle error in the middle there. Our database broke, and was returning errors to all queries. We fixed that, and then Watchdog quickly emailed us about all the errors in our AOLserver log, too late to do any good.

So, I think Watchdog currently depends on the database being available, and silently fails if it's not.

The little wd_email_frequency helper proc (which I may have written) looks suspicious, as it's call to ad_parameter can implicitly do a database query. I bet there are other places in Watchdog where a broken database will stop all error reporting.

What do you think is the best approach to fixing this?

Calls like ad_parameter make possible database dependencies harder for me to understand. ad_parameter and friends (parameter::get, ad_parameter_cache, etc.) clearly support caching of fetched values in some fashion, but I don't know if or when it is ever safe to conlude that the cache will definitely be populated, and that it's thus safe to use these sorts of calls from Watchdog.

My instinct is to entirely eliminate all database use entirely from Watchdog, and fetch any necessary settings (like WatchDogFrequency) solely from the AOLserver config file via ns_config.

To test any of that, I'd want to selectively break database access in the Watchdog thread, probably by redefining some key DB API call to fail (or log warnings). That would be useful for both tracking down database dependencies in the first place, and eventually verifying that Watchdog works even if the database is broken. What would be the best place to add that instrumentation, perhaps either in db_exec, or ns_db itself?

Collapse
Posted by Steve Manning on
Andrew

It also looks, from the [-sched-] in your logs, as if the watchdog also runs in the main schedule rather than its own thread. Doesn't that mean it could potentially be blocked by another long running process in the main schedule?

Might be better to move it to its own schedule thread.

- Steve

Collapse
Posted by Ryan Gallimore on
Hi Andrew,

I usually run uptime.openacs.org on all my servers to ensure the instance and db are up and running, along with watchdog.

Just point uptime at /SYSTEM/dbtest.tcl

Together, they cover most errors.

Ryan

Collapse
Posted by Andrew Piskorski on
Ryan, yes, Uptime is a very useful service, and will catch cases like this one where all database access broke site-wide.

However, I can imagine a scenario where only some database access was broken, such that dbtest.tcl continued to work but Watchdog silently failed to run. That would be a Bad Thing.

Collapse
Posted by Andrew Piskorski on
Steve, you raise a good point: Since Watchdog was running in the main scheduler thread, the failure could probably also have been caused by another scheduled proc going into an infinite loop or hanging indefinitely.

Here's the same section of my server log, but also showing the very most recent before and after occurrence of "[-sched-]":

[15/Nov/2008:13:35:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:13:40:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...

[15/Nov/2008:13:45:02][14058.1082632832][-sched-] Notice: SQL():
        delete from sec_session_properties
        where last_hit < :expires
[15/Nov/2008:13:45:02][14058.1082632832][-sched-] Notice: bind variable 'expires' = '1226702702'

[15/Nov/2008:14:22:07][14058.1094724224][-conn:outpost-prod::23] Error: nsoracle.c:2994:Ns_OracleOpenDb: error in `OCISessionBegin ()': ORA-00257: archiver error. Connect internal only, until freed.

[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Warning: db_exec: longdb 10679637469 nsdb0 dml dbqd.acs-tcl.tcl.security-procs.sec_sweep_sessions.sessions_sweep
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Warning: sched: excessive time taken by proc 7 (10679 seconds)
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: SQL():
        select acs_sc_contract.get_id('RssGenerationSubscriber') from dual
[15/Nov/2008:16:43:01][14058.1096739456][-sched:21-] Notice: SQL():
        begin Ctx_Ddl.Optimize_Index ('swi_index','FAST'); end;

[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Errors found.

Code can run in the AOLserver scheduler without logging anything to the error log, so the above log snippets are not necessarily the whole picture. But they are informative, at least.

Note that the wd_mail_errors that actually gets schedule is very simple, it starts out:

ad_proc wd_mail_errors {} "" {
    set proc_name {wd_mail_errors}
    set num_minutes [wd_email_frequency]
    ns_log Notice "Watchdog($proc_name): Looking for errors..."
    set errors [wd_errors -num_minutes $num_minutes]

For those 3 hours, there are no "Looking for errors..." messages in the log at all. Therefore, we know the blockage happened before that ns_log got called. This means that the blockage must have happened either in wd_email_frequency, or earlier, in something that prevented the scheduled wd_mail_errors proc from ever even starting to run.

I just noticed that "excessive time taken by proc 7 (10679 seconds)" in my log above. 10679 seconds is about 3 hours. So that was the actual problem - due to the database access failure, this little bit of code apparently hung for the whole 3 hours, completely tieing up the AOLserver scheduler thread:
ad_proc -private sec_sweep_sessions {} { 
    set expires [expr {[ns_time] - [sec_session_lifetime]}] 
    db_dml sessions_sweep {}  
    db_release_unused_handles 
} 
And the SQL that proc ran is also very simple, just:
delete from sec_session_properties 
where last_hit < :expires 

So at least on this Oracle 10g (10.2.0.2.0) server, under certain Oracle-wide error conditions a simple delete statement can hang forever, and will never time out. That makes some degree of sense. The delete statement needs to generate rollback, and the failure which froze up Oracle for those 3 hours was running out of either online or archived redo logs space, which is intimately related to rollback.

Somewhat oddly, the delete statement above, which normally takes less than 1 second, started at 13:45, and ran for 37 minutes before another thread triggered the very first Oracle error in the AOLserver log. There was also a lot of other activity in the log during those 37 minutes, all which worked fine without errors. Yet the delete was hanging all that time, which presumably means that Oracle was starting to have trouble well before it finally triggered client errors.

Yes I was about to point that out that "excessive time" entry as its exactly what happened to us with two scheduled events sharing the main thread.

Its actually easy to setup a schedule in its own thread you just use the -thread switch thus:

ad_schedule_proc -thread t -debug t 600 backoffice::export::send_promos
or
ad_schedule_proc -thread t -debug t -schedule_proc ns_schedule_daily {04 30} backoffice::export::send_stock_status

The downside I believe is that it sets up and tears down its own interp - but thats not usually too much of a problem.

- Steve

Steve, yes, the -thread option to ns_schedule_proc (which ad_schedule_proc just passes along) is what I referred to above. If I remember correctly, that starts up a new thread for every individual run of the job, and then kills it. If the main scheduler thread is locked up, will the thread for the new job ever get spawned? I'd need to test it, but I suspect the answer is no, it will not.
Collapse
Posted by Andrew Piskorski on
Ideally, Watchdog should probably have its own single long-lived thread, which it and it alone uses. I don't think ns_schedule_* supports that directly, but it's probably not that hard to do. Starting up a new fresh thread for every Watchdog run, which ns_schedule_* does readily support, doesn't seem ideal but might work ok in practice.
One of these days, it might be worth using the newer threadpools stuff (e.g., tpool, ns_job, or conceivably even Mayoff's ancient dqd_threadpool) and/or other features to have AOLserver never run any jobs directly in the scheduler thread, and yet still achieve the same best-case efficiency and speed as the current simpler scheduler. A real scheduler should never allow client code to interfere with the scheduler's own housekeeping.

That would also change some of the ns_schedule edge case semantics, as independent jobs would no longer be implicitly serialized due to running in the single scheduler thread. But in the unlikely even that the implicit serialization was actually desired, it should be easy just to shove a bunch of legacy scheduled jobs into a single thread.