I recently encountered an interesting failure mode in Watchdog. This
was in our lightly modified OpenACS 5.2.3 intranet, but in CVS the
OpenACS Monitoring package
hasn't been changed since well before then, so this should still apply
in OpenACS 5.4.3 and later.
We schedule Watchdog to run every 5 minutes. However, last Saturday,
it didn't run at all for this 3 hour period:
[15/Nov/2008:13:35:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:13:40:14][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:14:22:07][14058.1094724224][-conn:outpost-prod::23] Error: nsoracle.c:2994:Ns_OracleOpenDb: error in `OCISessionBegin ()': ORA-00257: archiver error. Connect internal only, until freed.
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
[15/Nov/2008:16:43:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Errors found.
[15/Nov/2008:16:48:01][14058.1082632832][-sched-] Notice: Watchdog(wd_mail_errors): Looking for errors...
The reason is that Oracle error in the middle there. Our database
broke, and was returning errors to all queries. We fixed
that, and then Watchdog quickly emailed us about all the
errors in our AOLserver log, too late to do any good.
So, I think Watchdog currently depends on the database being
available, and silently fails if it's not.
The little
wd_email_frequency
helper proc (which I may have written) looks suspicious, as it's call
to ad_parameter
can implicitly do a database query. I
bet there are other places in Watchdog where a broken database will
stop all error reporting.
What do you think is the best approach to fixing this?
Calls like ad_parameter
make possible database
dependencies harder for me to understand. ad_parameter
and friends (parameter::get
, ad_parameter_cache
,
etc.) clearly support caching of fetched values in some fashion, but
I don't know if or when it is ever safe to conlude that the cache will
definitely be populated, and that it's thus safe to use these sorts of
calls from Watchdog.
My instinct is to entirely eliminate all database use entirely from
Watchdog, and fetch any necessary settings (like
WatchDogFrequency
) solely from the AOLserver config file
via ns_config
.
To test any of that, I'd want to selectively break database access in
the Watchdog thread, probably by redefining some key DB API call to
fail (or log warnings). That would be useful for both tracking down
database dependencies in the first place, and eventually verifying
that Watchdog works even if the database is broken. What would be the
best place to add that instrumentation, perhaps either
in db_exec
, or ns_db
itself?