Forum OpenACS Q&A: Re: Re: Re: ad_schedule_proc seems to be failing

Collapse
Posted by Stan Kaufman on
You should post the "what's different between AOLserver 3.3+ad13 and 4.0.10 re this problem" over at the AOLServer list. I posted such a question over the weekend, but no one has commented/explained so far. Maybe now that there are reports of problems with 4.x, someone will have a look.

Have you tried setting both MaxOpen and MaxIdle to 0? Or simply undefining them?

Collapse
Posted by Brian Fenton on
Hi Stan,

yes I've set MaxOpen and MaxIdle to 0 on all my systems. I was just trying to understand the problem more clearly - I haven't really yet seen an explanation I can understand. And I'm not even sure if the problems my client has been seeing is related to this year 2038 issue.

Unfortunately, I've been unable to mail the AOLserver list due to a reverse DNS issue we're having here, but I have my mail ready to go once the DNS is sorted.

thanks
Brian

Let me try and explain:

During AOLserver startup, if MaxOpen or MaxIdle is a positive number (not zero which means "forever") then AOLserver schedules a job to check to reset the database connections at [expr [clock seconds] + $MaxOpen] (as it were).

After May 12, 2006, the current time since the beginning of the epoch, plus a MaxIdle/MaxOpen setting of 1 billion seconds resulted in a scheduled event that overflowed a 32-bit signed integer. (It wrapped around and became a negative value.)

From what I gather from the AOLserver list, on Solaris this leads to a hard crash in some pthread function call. On Linux it just seems to forever hang up processing of scheduled events (because it can't cope with a negative time and every negative number is less than any positive number).

On Linux people who don't have MaxIdle or MaxOpen set at 1000000000 or who haven't restarted AOLserver since May 12th won't have experienced the problem. (For someone with a 1 billion setting who last restarted on May 11th then AOLserver is scheduled to reset the database connections in mid-January 2038 right now...)

A setting of 100 million, instead of 1 billion, wouldn't have exposed this condition on AOLserver 3.x for another twenty-eight years or so. Zero is the right value to use now. (Apparently 1 billion was chosen, instead of zero, due to some bug in the Oracle driver or the Oracle client libraries... 1 billion being "effectively" forever... until this month!)

Thanks Michael! Very clear explanation.

Brian

Yay for this thread and OpenACS. My AOLServer restarted for the first time today since May 12th 2006 and I saw the error. I found this thread and fixed it quickly.