Forum OpenACS Q&A: ad_schedule_proc seems to be failing

Collapse
Posted by Paul Gilbert on
Hi all, I seem to have a serious problem since a server restart of a server I work with on the weekend (13/6/05) - during startup, although ad_schedule_proc is called to queue up several repeatedly running scheduled methods, they're no longer being called anymore.

The system is running Linux 2.4.2-2, with ACS AOLserver/3.2+ad12 (I know it's not OpenACS, but I'm hoping I can nevertheless get some help).

I've traced through the ad_schedule_proc code, and couldn't see any problems, with it passing the request onto ns_schedule_proc. The strange thing is that when I changed the date back several months, the server restarted fine again, running the scheduled procs smoothly.

I'm kind of worried that I may have fallen victim to some kind of integer overflow problem with the representation of date/times as integers (that's the only thing I can think of), and I'm pretty much in over my head. I'm hoping someone can offer some suggestions.

Thanks.

Collapse
Posted by Gustaf Neumann on
Check out the message of Guan Yang in

http://news.gmane.org/gmane.comp.web.aolserver/

There seems something wierd going on with AOLserver 3.* installations. I would try to upgrade to AOLserver 4.0.10 and a recent tcl version (>8.4.11).

Collapse
Posted by Janine Ohmer on
I seem to be seeing this as well, on our last remaining 3.3+ad13 installation. Scheduled procs aren't firing anymore.

Unfortunately upgrading is a last, last resort (client's wishes) so I need to try to figure this out I guess. Very, very strange.

I'll go post to the AOLserver list also.

Collapse
Posted by Paul Gilbert on
It seems it must have been a known problem - an associate who'd been experimenting with a prototype of our software using a newer version of AOLServer reported that everything still worked fine.

In any case, he was able to provide a temporary fix for the problem by setting up a crontab entry to run a script file that contains a wget to get a page from our website, and we put some code in that page to call all the required scheduled methods. It's a bit roundabout, but at least it should save us from having to franticly trying to update a production website in the short term.

Collapse
Posted by Stan Kaufman on
If this is a "known" problem, what's known about it? Something in AOLServer? Some OS problem? Some weird date-related problem? Something entirely unrelated to OpenACS? Where has this been discussed before? The discussion over on the AOLServer list by people who know vastly more than I doesn't allude to any prior knowledge of this (http://news.gmane.org/gmane.comp.web.aolserver)

Seems pretty strange for production sites to suddenly develop problems after running fine for years -- and for other sites apparently not to manifest problems. I've restarted clones of my OpenACS 3.2.5/AOLserver/3.3.1+ad13 sites on a dev box and found no problems with VM or with scheduled procs not firing. I'm not restarting the production sites to see what happens until I learn more about what might be going on.

This is way further into the plumbing than I enjoy venturing, so I certainly hope those who can peer deeply into the stack will let us know what is afoot.

Collapse
Posted by Zachary Shaw on
For what it's worth we're having the same problem at Brandeis University and when we change our system date AOLServer works fine.

I even brought up a vanilla version of aolServer 3.3.1+ad13 no ACS. All I'm doing is loading a page that schedules a number of procedures (ns_log's) and depending on the date they would stop firing.

if anyone has any insight it would be greatly appreaciated.

Collapse
Posted by Stan Kaufman on
Other than the VM and scheduled proc problems, has anyone had any other difficulties with AOLserver/3.3.1+ad13? Specifically, has anyone seen problems with Jerry Asher's nsvhr/nsunix virtual hosting method? Is anyone else even still using Jerry Asher's nsvhr/nsunix virtual hosting method?
Collapse
Posted by Zachary Shaw on
we saw some wierd log behavior.

It seems like our production logs didn't roll from may 13 - may 16. Not sure if this is another piece of the puzzle or not.

Collapse
Posted by Geert De Witte on
I am also using AOLserver/3.3.1+ad13 with Jerry Asher's nsvhr and nsunix.

All seemed to be working well, until yesterday. After restarting a server instance today (18 May 2006) the nsd server starts listening on the appropriate socket but then shuts down with:

nsthread(19614) error: pthread_create failed in NsThreadCreate: Cannot allocate memory

I have seen the same error on more than one system. No changes have been made at the OS or database levels prior to these errors cropping up.

Does anybody have an idea how to fix or work around this error? Does it mean upgrading to AOLServer 4? I am experiencing these errors on production sites, so would like to get them back up and running ASAP. I'm not having any problems with AOLServer 4.0.10.

Last few lines of error log:
-----------------------------
[18/May/2006:12:19:56][19614.1076616112][-sched-] Notice: sched: starting
[18/May/2006:12:19:57][19614.1074158336][-main-] Notice: serv: warmed up
[18/May/2006:12:19:57][19614.1074158336][-main-] Notice: socks: idle
[18/May/2006:12:19:57][19614.1074158336][-main-] Notice: sched: idle
[18/May/2006:12:19:57][19614.1074158336][-main-] Notice: nsunix: DrvStart starting: listenSocket = 9, modules/nsunix/jdio.nsunix
[18/May/2006:12:19:57][19614.1074158336][-main-] Notice: nssock: listening on 0.0.0.0:8010
[18/May/2006:12:19:57][19614.1138777008][-nssock-] Notice: nssock: starting
[18/May/2006:12:19:57][19614.1136671664][-thread1136671664-] Notice: nsunix: accepting
[18/May/2006:12:19:57][19614.1138777008][-nssock-] Notice: nssock: accepting connections
nsthread(19614) error: pthread_create failed in NsThreadCreate: Cannot allocate memory

Collapse
Posted by Stan Kaufman on
Geert, what version of tcl are you using on the affected boxes? Discussion on the AOLServer list (http://news.gmane.org/gmane.comp.web.aolserver) suggests that this may be a problem with tcl 8.3.x. Or maybe with glibc. Dossy wants people to post this info about their systems:

1) ns_info version
2) uname -a
3) glibc version
4) info patchlevel

If you're going to move your sites to AOLServer 4, what are you going to do to replace Jerry's nsvhr/nsunix? Use pound? The default config file in OpenACS 5.2 appears to work fine for 3.2.5 sites with only minor modifications (including sourcing your specific instance's config file), so for a single site there is little reason not to move directly to AOLServer 4. But what is the best way to handle virtual hosting for multiple sites on the same server box?

Collapse
11: Virtual Hosting (response to 1)
Posted by Vinod Kurup on
But what is the best way to handle virtual hosting for multiple sites on the same server box?

Hey Stan,

AOLserver 4 does have virtual hosting built-in but pound is more flexible. You can stop/start individual servers and it gives you the option of running non-AOLserver web servers if you want 😊

Built-in AOLserver virtual hosting instructions are here: http://panoptic.com/wiki/aolserver/Virtual_Hosting

If you want to use pound, it's pretty simple to set up. My /etc/pound/pound.cfg file looks something like this:

...

ListenHTTP
  Address     66.98.222.124
  Port        80
  xHTTP       0
  WebDAV      0
  Change30x   1
End

Service
  HeadRequire "Host:.*kurup.org.*"
  # make sure not to match other servers
  HeadDeny    "Host:.*dev.kurup.org.*"
  BackEnd
    Address   127.0.0.1
    Port      8001
  End
  Session
    Type      COOKIE
    ID        "ad_session_id"
    TTL       1800
  End
End
...
Collapse
12: Re: Virtual Hosting (response to 11)
Posted by Stan Kaufman on
Many thanks for the pointers, Vinod! I had planned to emerge scratching and yawning from the Dark Ages of 3.2.5/3.3.1+ad13/nsunix into the Glorious Dawn of 5.x/4.0.10/pound, but this situation (even though I'm not certain I'm afflicted) has put a pointed boot to my backside to hasten me along.
Collapse
13: Re: Virtual Hosting (response to 11)
Posted by Tracy Adams on
Also, if you want to have different domains and have ssl on each one, AOLServer with virtual hosting will not handle it.

As I understand it, the domain name in the http request will be encrypted, so AOLServer will not know which ssl certificate to present.

Pound can help you do this as well (haven't tried it but so I've read)

Collapse
Posted by Geert De Witte on
Apologies for the delay - just been busy upgrading to AOLServer 4.

I have been "lucky", in that my front end server has not had the same problem as I mentioned previously (pthread_create failed), even though it is running with the aolserver-3.3.1+ad13 version.

I also have a static website (does not use OpenACS software and does not connect to database) which runs with aolserver-3.3.1+ad13 and which was also giving me the pthread_create error. After commenting out the Database pool sections in the config.tcl file, I could start the static website without any errors.

I will probably use pound to replace the nsvhr/nsunix functionality. Judging from the other comments in this thread, this seems to be the best approach for http as well as https connections.

Collapse
Posted by Zachary Shaw on
Since no one else has posted it here yet.

'Jesus' Jeff Rogers has solved the Problem and posted a solution on the AOLServer mailing list.

It's amusingly enough a y2038 problem.

http://news.gmane.org/gmane.comp.web.aolserver/

"I fixed it by simply changing my MaxOpen/MaxIdle settings to "0" which is interpreted as "forever" which is probably what the original "one BILLION seconds" was undoubtedly intended to be."

Collapse
Posted by Stan Kaufman on
Interestingly, anyone using the stock OpenACS 3.2.5 config file avoided this problem, as MaxOpen and MaxIdle are not defined as 1B; they're in fact not defined at all -- those lines were commented out. Anyone know why that was?

In any case, that presumably is why those of us running OpenACS 3.2.5/AOLServer 3.3.1+ad13 systems have been OK, while those with problems were running ACS-derived systems prior to 3.2.5 -- in which MaxOpen and MaxIdle were defined as 1B.

Collapse
Posted by Brian Fenton on
I don't fully understand the effects of this problem. If it's an Oracle driver problem, shouldn't it affect all AOLserver versions, not just 3.3+ad13? Does it only affect 32-bit operating systems? One particular client of mine has recently been having a lot of problems with scheduled procs not always running - I wonder is this issue an element? They're seeing the problem on both AOLserver 3.3+ad13 and 4.0.10 (64 bit).

thanks
Brian

Collapse
Posted by Stan Kaufman on
You should post the "what's different between AOLserver 3.3+ad13 and 4.0.10 re this problem" over at the AOLServer list. I posted such a question over the weekend, but no one has commented/explained so far. Maybe now that there are reports of problems with 4.x, someone will have a look.

Have you tried setting both MaxOpen and MaxIdle to 0? Or simply undefining them?

Collapse
Posted by Brian Fenton on
Hi Stan,

yes I've set MaxOpen and MaxIdle to 0 on all my systems. I was just trying to understand the problem more clearly - I haven't really yet seen an explanation I can understand. And I'm not even sure if the problems my client has been seeing is related to this year 2038 issue.

Unfortunately, I've been unable to mail the AOLserver list due to a reverse DNS issue we're having here, but I have my mail ready to go once the DNS is sorted.

thanks
Brian

Let me try and explain:

During AOLserver startup, if MaxOpen or MaxIdle is a positive number (not zero which means "forever") then AOLserver schedules a job to check to reset the database connections at [expr [clock seconds] + $MaxOpen] (as it were).

After May 12, 2006, the current time since the beginning of the epoch, plus a MaxIdle/MaxOpen setting of 1 billion seconds resulted in a scheduled event that overflowed a 32-bit signed integer. (It wrapped around and became a negative value.)

From what I gather from the AOLserver list, on Solaris this leads to a hard crash in some pthread function call. On Linux it just seems to forever hang up processing of scheduled events (because it can't cope with a negative time and every negative number is less than any positive number).

On Linux people who don't have MaxIdle or MaxOpen set at 1000000000 or who haven't restarted AOLserver since May 12th won't have experienced the problem. (For someone with a 1 billion setting who last restarted on May 11th then AOLserver is scheduled to reset the database connections in mid-January 2038 right now...)

A setting of 100 million, instead of 1 billion, wouldn't have exposed this condition on AOLserver 3.x for another twenty-eight years or so. Zero is the right value to use now. (Apparently 1 billion was chosen, instead of zero, due to some bug in the Oracle driver or the Oracle client libraries... 1 billion being "effectively" forever... until this month!)

Thanks Michael! Very clear explanation.

Brian

Yay for this thread and OpenACS. My AOLServer restarted for the first time today since May 12th 2006 and I saw the error. I found this thread and fixed it quickly.