Forum OpenACS Q&A: Re: ad_schedule_proc seems to be failing

Collapse
Posted by Zachary Shaw on
Since no one else has posted it here yet.

'Jesus' Jeff Rogers has solved the Problem and posted a solution on the AOLServer mailing list.

It's amusingly enough a y2038 problem.

http://news.gmane.org/gmane.comp.web.aolserver/

"I fixed it by simply changing my MaxOpen/MaxIdle settings to "0" which is interpreted as "forever" which is probably what the original "one BILLION seconds" was undoubtedly intended to be."

Collapse
Posted by Stan Kaufman on
Interestingly, anyone using the stock OpenACS 3.2.5 config file avoided this problem, as MaxOpen and MaxIdle are not defined as 1B; they're in fact not defined at all -- those lines were commented out. Anyone know why that was?

In any case, that presumably is why those of us running OpenACS 3.2.5/AOLServer 3.3.1+ad13 systems have been OK, while those with problems were running ACS-derived systems prior to 3.2.5 -- in which MaxOpen and MaxIdle were defined as 1B.

Collapse
Posted by Brian Fenton on
I don't fully understand the effects of this problem. If it's an Oracle driver problem, shouldn't it affect all AOLserver versions, not just 3.3+ad13? Does it only affect 32-bit operating systems? One particular client of mine has recently been having a lot of problems with scheduled procs not always running - I wonder is this issue an element? They're seeing the problem on both AOLserver 3.3+ad13 and 4.0.10 (64 bit).

thanks
Brian

Collapse
Posted by Stan Kaufman on
You should post the "what's different between AOLserver 3.3+ad13 and 4.0.10 re this problem" over at the AOLServer list. I posted such a question over the weekend, but no one has commented/explained so far. Maybe now that there are reports of problems with 4.x, someone will have a look.

Have you tried setting both MaxOpen and MaxIdle to 0? Or simply undefining them?

Collapse
Posted by Brian Fenton on
Hi Stan,

yes I've set MaxOpen and MaxIdle to 0 on all my systems. I was just trying to understand the problem more clearly - I haven't really yet seen an explanation I can understand. And I'm not even sure if the problems my client has been seeing is related to this year 2038 issue.

Unfortunately, I've been unable to mail the AOLserver list due to a reverse DNS issue we're having here, but I have my mail ready to go once the DNS is sorted.

thanks
Brian

Let me try and explain:

During AOLserver startup, if MaxOpen or MaxIdle is a positive number (not zero which means "forever") then AOLserver schedules a job to check to reset the database connections at [expr [clock seconds] + $MaxOpen] (as it were).

After May 12, 2006, the current time since the beginning of the epoch, plus a MaxIdle/MaxOpen setting of 1 billion seconds resulted in a scheduled event that overflowed a 32-bit signed integer. (It wrapped around and became a negative value.)

From what I gather from the AOLserver list, on Solaris this leads to a hard crash in some pthread function call. On Linux it just seems to forever hang up processing of scheduled events (because it can't cope with a negative time and every negative number is less than any positive number).

On Linux people who don't have MaxIdle or MaxOpen set at 1000000000 or who haven't restarted AOLserver since May 12th won't have experienced the problem. (For someone with a 1 billion setting who last restarted on May 11th then AOLserver is scheduled to reset the database connections in mid-January 2038 right now...)

A setting of 100 million, instead of 1 billion, wouldn't have exposed this condition on AOLserver 3.x for another twenty-eight years or so. Zero is the right value to use now. (Apparently 1 billion was chosen, instead of zero, due to some bug in the Oracle driver or the Oracle client libraries... 1 billion being "effectively" forever... until this month!)

Thanks Michael! Very clear explanation.

Brian

Yay for this thread and OpenACS. My AOLServer restarted for the first time today since May 12th 2006 and I saw the error. I found this thread and fixed it quickly.