Forum OpenACS Development: Re: Watchdog fails to run if database is broken

Collapse
Posted by Steve Manning on
Andrew

It also looks, from the [-sched-] in your logs, as if the watchdog also runs in the main schedule rather than its own thread. Doesn't that mean it could potentially be blocked by another long running process in the main schedule?

Might be better to move it to its own schedule thread.

- Steve

Collapse
Posted by Ryan Gallimore on
Hi Andrew,

I usually run uptime.openacs.org on all my servers to ensure the instance and db are up and running, along with watchdog.

Just point uptime at /SYSTEM/dbtest.tcl

Together, they cover most errors.

Ryan

Collapse
Posted by Andrew Piskorski on
Ryan, yes, Uptime is a very useful service, and will catch cases like this one where all database access broke site-wide.

However, I can imagine a scenario where only some database access was broken, such that dbtest.tcl continued to work but Watchdog silently failed to run. That would be a Bad Thing.

Collapse
Posted by Andrew Piskorski on
Steve, you raise a good point: Since Watchdog was running in the main scheduler thread, the failure could probably also have been caused by another scheduled proc going into an infinite loop or hanging indefinitely.

Here's the same section of my server log, but also showing the very most recent before and after occurrence of "[-sched-]":

[15/Nov/2008:13:35:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:13:40:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...

[15/Nov/2008:13:45:02][14058.1082632832][-sched-] Notice: SQL():
        delete from sec_session_properties
        where last_hit < :expires
[15/Nov/2008:13:45:02][14058.1082632832][-sched-] Notice: bind variable 'expires' = '1226702702'

[15/Nov/2008:14:22:07][14058.1094724224][-conn:outpost-prod::23] Error: nsoracle.c:2994:Ns_OracleOpenDb: error in `OCISessionBegin ()': ORA-00257: archiver error. Connect internal only, until freed.

[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Warning: db_exec: longdb 10679637469 nsdb0 dml dbqd.acs-tcl.tcl.security-procs.sec_sweep_sessions.sessions_sweep
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Warning: sched: excessive time taken by proc 7 (10679 seconds)
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: SQL():
        select acs_sc_contract.get_id('RssGenerationSubscriber') from dual
[15/Nov/2008:16:43:01][14058.1096739456][-sched:21-] Notice: SQL():
        begin Ctx_Ddl.Optimize_Index ('swi_index','FAST'); end;

[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Errors found.

Code can run in the AOLserver scheduler without logging anything to the error log, so the above log snippets are not necessarily the whole picture. But they are informative, at least.

Note that the wd_mail_errors that actually gets schedule is very simple, it starts out:

ad_proc wd_mail_errors {} "" {
    set proc_name {wd_mail_errors}
    set num_minutes [wd_email_frequency]
    ns_log Notice "Watchdog($proc_name): Looking for errors..."
    set errors [wd_errors -num_minutes $num_minutes]

For those 3 hours, there are no "Looking for errors..." messages in the log at all. Therefore, we know the blockage happened before that ns_log got called. This means that the blockage must have happened either in wd_email_frequency, or earlier, in something that prevented the scheduled wd_mail_errors proc from ever even starting to run.

I just noticed that "excessive time taken by proc 7 (10679 seconds)" in my log above. 10679 seconds is about 3 hours. So that was the actual problem - due to the database access failure, this little bit of code apparently hung for the whole 3 hours, completely tieing up the AOLserver scheduler thread:
ad_proc -private sec_sweep_sessions {} { 
    set expires [expr {[ns_time] - [sec_session_lifetime]}] 
    db_dml sessions_sweep {}  
    db_release_unused_handles 
} 
And the SQL that proc ran is also very simple, just:
delete from sec_session_properties 
where last_hit < :expires 

So at least on this Oracle 10g (10.2.0.2.0) server, under certain Oracle-wide error conditions a simple delete statement can hang forever, and will never time out. That makes some degree of sense. The delete statement needs to generate rollback, and the failure which froze up Oracle for those 3 hours was running out of either online or archived redo logs space, which is intimately related to rollback.

Somewhat oddly, the delete statement above, which normally takes less than 1 second, started at 13:45, and ran for 37 minutes before another thread triggered the very first Oracle error in the AOLserver log. There was also a lot of other activity in the log during those 37 minutes, all which worked fine without errors. Yet the delete was hanging all that time, which presumably means that Oracle was starting to have trouble well before it finally triggered client errors.

Yes I was about to point that out that "excessive time" entry as its exactly what happened to us with two scheduled events sharing the main thread.

Its actually easy to setup a schedule in its own thread you just use the -thread switch thus:

ad_schedule_proc -thread t -debug t 600 backoffice::export::send_promos
or
ad_schedule_proc -thread t -debug t -schedule_proc ns_schedule_daily {04 30} backoffice::export::send_stock_status

The downside I believe is that it sets up and tears down its own interp - but thats not usually too much of a problem.

- Steve

Steve, yes, the -thread option to ns_schedule_proc (which ad_schedule_proc just passes along) is what I referred to above. If I remember correctly, that starts up a new thread for every individual run of the job, and then kills it. If the main scheduler thread is locked up, will the thread for the new job ever get spawned? I'd need to test it, but I suspect the answer is no, it will not.
Collapse
Posted by Andrew Piskorski on
Ideally, Watchdog should probably have its own single long-lived thread, which it and it alone uses. I don't think ns_schedule_* supports that directly, but it's probably not that hard to do. Starting up a new fresh thread for every Watchdog run, which ns_schedule_* does readily support, doesn't seem ideal but might work ok in practice.