I just noticed that "excessive time taken by proc 7 (10679 seconds)"
in my log above. 10679 seconds is about 3 hours. So that was the
actual problem - due to the database access failure,
this
little bit of code apparently hung for the whole 3 hours, completely
tieing up the AOLserver scheduler thread:
ad_proc -private sec_sweep_sessions {} {
set expires [expr {[ns_time] - [sec_session_lifetime]}]
db_dml sessions_sweep {}
db_release_unused_handles
}
And the SQL that proc ran is also very simple, just:
delete from sec_session_properties
where last_hit < :expires
So at least on this Oracle 10g (10.2.0.2.0) server, under certain
Oracle-wide error conditions a simple delete statement can hang
forever, and will never time out. That makes some degree of sense.
The delete statement needs to generate rollback, and the failure which
froze up Oracle for those 3 hours was running out of either online or
archived redo logs space, which is intimately related to rollback.
Somewhat oddly, the delete statement above, which normally takes less
than 1 second, started at 13:45, and ran for 37 minutes before another
thread triggered the very first Oracle error in the AOLserver log.
There was also a lot of other activity in the log during those 37
minutes, all which worked fine without errors. Yet the delete was
hanging all that time, which presumably means that Oracle was starting
to have trouble well before it finally triggered client errors.