Forum OpenACS Development: Can two ns_schedule procs hose a server?

We have a production system with acs 3.2/aolserver3.3 ad13 and PG7.13 on RH 7.2. The system does data collection from an external source with ns_schedule_proc that runs every 15 minutes; the ns_schedule_proc calls a PLpgSQL function that does the heavy lifting. Due to the volume of data being processed the script usually runs for about 60s but can run as long as 10-15' if there is a lot of content to process.

We are seeing a situation where the AOLserver spontaneously dies - 2 -4' AFTER the script completes.

The last 2 times it happened - it was on a Monday at 7pm and Wednesday at 7am - which may coincide with jobs run by gc-defs.tcl.  Unfortunately NOTHING appears in the log.

other acs3 production boxes without this insert script have been running non-stop for about a year.

The documentation on ns_schedule is very thin and it isnt clear what is happening

I would like to be able to consistently reproduce the problem but so far have not been succesfull

Any help you guys can give is greatly appreciated.

danny Lieberman

Posted by Janine Ohmer on
How do you define "spontaneously dies"?

I have many sites running 3.3+ad13 and all of the busy ones have a tendency to freeze.  All the nsd threads are still running, but they stop serving pages.  No errors in the error log but I have noticed that very often (but not always) the last thing in the log is the ns_log statement announcing the end of a scheduled proc.  Which would be consistent with what Danny is reporting.

It only happens on busy sites.  I have several sites where the staging and live sites share the same database, so there are two sets of scheduled procs running.  Only the live sites freeze.  So it's not *just* the procs - it's the procs in combination with nsd serving other pages at the same time.

All of our sites that do this use Oracle, not Postgres, since our Oracle sites are the bigger, busier ones.  However, I'm not surprised to see that Danny is running Postgres;  this doesn't strike me as a database related problem.

I can't shed any more light on this, unfortunately - since these are live sites I can't take the time to do any debugging, I have to restart them right away.  I can say, from the few times when things have not gotten restarted as quickly as they should have, that it doesn't seem to resolve itself with time.  Once nsd loses it's mind, it's gone forever.

Posted by Dan Lieberman on
Thx for the response:
"Spontaneously dies" means this:
19:18 - end of schedule proc, excessive time taken (61 seconds)

19:22 - ns_getform select user_id etc...from users
(Somebody trying to login)
19:22 select passwd
19:22 update sec_sessions
19:22 select from wt_users (our extension to users)
19:22 update users set last_visit

The user logging in then gets a msg - page not available
when you do ps ax | grep nsd - all the nsd threads are gone.

I agree with you - with such simple queries - it is definitely not a db issue (Ora vs PG)

Posted by Jonathan Ellis on
if unicode doesn't matter to you, definitely upgrade to nsd 3.5.1 -- I have had zero crashes since where I used to have many/day.
Posted by Dan Lieberman on
good input.
How Unicode broken in nsd 3.5.1?
We are serving up UTF-8 from the data base,but we dont use any ACS routines, we HAVE added to nsd.tcl:

        ns_param  HackContentType 1
        ns_param  URLCharset  utf-8
        ns_param  OutputCharset utf-8
        ns_param  HttpOpenCharset  utf-8

Are these from the ad13 patch ?

What kind of issues would we have with the ad tcl library and the Postgres 2 db driver?

Posted by Jonathan Ellis on
yes, those are from ad13.

the PG driver works out of the box.  I've had no issues with the tcl library, but I am using 3.2.5.  I can't think of any reason 4.x would have trouble, though.

Posted by Dan Lieberman on
I am also using 3.25,

So what we're saying here is that PG and tcl arent an issue but serving up utf-8 pages wont work.

That would definitely break our application.
Maybe recompile the ArsDigita contribution for UTF-8 support?

I have to find a way of stabilizing this server. Maybe I can get away with DaemonTools restarting AOLserver whenever it crashes which can be either once in 2 weeks or twice in one day.

Posted by Dan Lieberman on
Well - it turns out it WAS a db related issue
Using PG 7.13 we had a query that was passing float8 to decimal arguments and time to integer and such - the
typing mis-match would clobber the backend from time to time.
After we upgraded to PG 7.31 which is much stricter on typing (and very nastily did away with the interval function) we discovered the issue.  FYI PG 731 does mildly break ACS ... be glad to supply the functions if anybody wants
Posted by Bart Teeuwisse on

OpenACS 4.6.1 is PG 7.3.1 compatible. Upgrade to that version if you haven't already done so. If you are at 4.6.1 and find bugs please report them in the bugtracker.


Posted by Jeff Davis on
4.6.1 hasn't really been released.  you can check out the
head of the 4.6 branch with -r oacs-4-6 to get the patched
version. Should be released end of next week if you would rather wait.
Posted by David Walker on
I have not managed to track down the full story of the interval function but I did find out this much through trial and error. The function does still exist but cannot be used directly for whatever reason.

  • select interval('1 day') fails
  • select interval '1 day' succeeds
  • select '1 day'::interval succeeds
  • select "interval"('1 day') succeeds