Forum OpenACS Q&A: Long-running AOLServer is forgetful?

I'm working on a project that uses OpenACS 4.7 and PostgreSQL 7.3 on Red Hat Linux 7.3.  As you might expect during such a project, I've created several packages.  And as you also might expect, I've set watches on the .tcl files in those packages.

I've just had a heckuva week, and haven't had much time to work on this project, let alone edit files (.tcl, .adp, or otherwise) in the packages that I wrote.  So you can imagine my surprise when the JavaScript/HTML guru on the project complained that the various changes to the JavaScript that I produce had been rolled back to a previous version.  In other words, it's as if my last three weeks worth of work were no longer being executed.

But the real kicker was that when I checked the .tcl source code (in both the "tcl" and "www" directories), it was all 100 percent OK.  That is, my code was correct, but somehow things were getting lost between the Moreover, restarting the server solved the problem.

So I'm really confused, and even a bit worried.  Why would AOLServer (and/or OpenACS) be keeping around old versions of my .tcl files?  Why would the old ones suddenly come to life?  Why should restarting the server fix this?  (Does it have something to do with ADP caching?)

Has anyone else seen this problem?  It was quite disturbing to see this happen, and I'm beginning to wonder what I should do to avoid it in the future.

Reuven

Collapse
Posted by Jade Rubick on
Hmmmm, Aolserver does cache .tcl files in memory, I believe. It probably does this by thread, so unless you restarted Aolserver, you could get old behavior with one thread, and new behavior with another thread.

But I've never seen behavior like this. Usually if something changes, Aolserver recaches the .tcl bytecode, I believe.

Collapse
Posted by Andrew Piskorski on
Reuven, the problem is probably not with AOLserver at all, but with the hackery OpenACS does to implement watches on those .tcl files. That's the place to start looking anyway.

What are your maxthreads, minthreads, and threadtimeout settings on that server?

Did you, maybe, turn on PerformanceModeP sometime during that week? That might well kill all file watching for any new threads created. Actually, if changing the PerformanceModeP flag (to either on or off) is obeyed at anything other than server start time, it would probably give you weird and nasty results - old threads would have the latest watched files, new threads would have older Tcl procs without the watched files.

If I remember correctly, the watched files stuff never changes the contents of the AOLserver master thread (or whatever it's called) at all, as that thread is written to only during AOLserver startup. When a new thread is spawned, AOLserver copies the master thread to the new thread. The OpenACS file watching stuff roughly does two things: One, it sources all the watched files into all existing threads. Two, after each new AOLserver thread is spawned, it sources the watched files into those threads too. Naturally these extra steps make thread spawning less efficient.

I'd never purposely leave a server running for a week with watched files, btw. I believe the whole file watching business was intended strictly as a development tool. If it's a Development server, I would simply restart the server every time I finish testing something and commit the code to CVS. A Production server probably shouldn't ever use watched files at all.

Oh yeah, and from the mailing list, it sounds like there's some last minute work going into AOLserver 4.0 to give us lazy loading of Tcl procs there. They're doing it in order to speed up start time of AOLservers with really large amounts of Tcl code, but it may end up giving us better tools for implementing stuff like the OpenACS file watching as well.

Collapse
Posted by Don Baccus on
Watching is indeed turned off if performance mode is enabled.

Andrew's description of how it works is incorrect.  Each thread will reload the watched files when they next serve a request, with the call to the apm proc located in the request processor.  The mtime for the file at watch time is cached for all threads in an nsv array, then each thread does a "file mtime" and compares it to the cached mtime.  If it differs, the file is sourced into the current interpreter.  The code looks like it might be susceptible to race conditions but as Andrew points out it's really meant for development.  That why setting performance mode "true" turns off the check.

Also ... if you're making changes to /www .tcl files the caching is being done by AOLserver, with mtime also being used to determine when to recompile the file.  You don't need to "watch" because AOLserver checks each time the page is served (/tcl libarary files are different because they're never served, they just consist of procs called from served pages)

So ISTM that something else is the culprit here.  Are there any proxy servers in the loop?  What are the browser cache settings? etc etc

Collapse
Posted by Reuven Lerner on
I'm willing to believe you guys in terms of assigning blame, since it seems reasonable to assume that the "watch" functionality is what's causing this trouble, especially if it's left on for days at a time.

That said:

(a) I didn't do anything with the .tcl files after turning on the watch.  But at some point, the watching began to ignore chnages to the files, and kept the in-memory version.  It's almost as if watching was turned off in practice, but kept on as far as the system flags were concerned.

(b) It sounds reasonable that watches should only be active for a short period of time.  The performance issues with watches were always obvious to me, but on a development machine, I didn't really care.  Indeed, with a lot of changes going on, I found it very useful and helpful to keep a file watched for long periods of time -- days or weeks.  It might be nice to warn people against keeping watches active for more than a short period of time.

In other words: Yeah, I shouldn't have kept the watching on for so long, and I'll avoid doing that in the future.  But I feel like hackers should get a warning about keeping watches active for too long, because there does seem to be some sort of corruption/caching issue here, and I would hate for someone else to encounter the same problem in the future.

Reuven

Collapse
Posted by Andrew Piskorski on
Reuven, ".tcl files" is ambiguous, when you say that I'm not sure whether you mean tcl page files under www/, or tcl library files in the tcl/ directory.

But actually, I think Don was also saying that he suspects the file watches may not have been the problem...

Collapse
Posted by Janine Ohmer on
I had a somewhat similar scary experience this week.  I changed the name of a function paramter in a /tcl tcl script, which was being watched, but I kept getting an error when calling it.  Finally I tried changing the name of the parameter in the call back to the old name, and it worked.  So I restarted the server - and *the error did not go away*.  I was finally able to get it to go away by editing the tcl file, which caused the watch to reload it.

The really bizarre thing is that later that day, the file reverted back to it's previous form again, and I had to go through the above again.  I was definitely hearing Twilight Zone music that afternoon....

Collapse
Posted by Tom Jackson on

I'm not sure if this is related or not, but sometimes I have to empty my browser cache and then kill and restart the browser. It may have to do with the mtime on the file plus the cache's behavior. I essentially had the same problem: an error kept appearing.

So question: did the log also report the error?

Also debugging a request should be easier than it is. I started using index.vuh files, and it seems that once the rp hits that file, it stops logging. It actually reports the wrong file (it is the adp template it thinks it is going to use, but if the tcl file uses ad_return_template with something else, the rp has no clue).

Collapse
Posted by Lars Pind on
Hm.

Watching is only relevant for .tcl library files.

However, the templating system does its own caching of compiled .adp files, in that it compiles the .adp file into a proc in a certain namespace, which is the reused.

I'm actually not too familiar with the details, but I do recall having been bitten by an .adp page not being up-to-date wrt the .adp in the file system.

Reuven, can you help us by clarifying whether this was a library /tcl/*-procs.tcl file, a www/*.tcl file, or a www/*.adp file?

/Lars

Collapse
Posted by Janine Ohmer on
Tom,

Yes, in my case the error did show up in the log.

Collapse
Posted by Reuven Lerner on
Sorry for taking so long to respond, guys; I just moved to Chicago, and moving yourself, consulting work, a wife, and two children is not for the faint of heart.

I'm pretty  sure that the problem was with a .tcl file in the tcl/ directory.  I looked through my mail archive from that project, and I don't see a 100 percent answer on this topic.

But the fact that I checked to see if the file was being watched, and the proc names that I mention in my internal e-mail messages are in .tcl files lead me to believe that it was  in the tcl/ directory.

But it makes me feel *really* good to know that I'm not the only one scratching his head and experiencing some weird behavior here.  As someone else wrote, it was definitely Twighlight Zone-ish to modify a file and not see the results reflected on the Web page.

Reuven

Collapse
Posted by Janine Ohmer on
The Twilight Zone music is really blaring now...

I modified the -html_p parameter to forum::message::new, renaming it -format.  There are no traces of html_p left anywhere that matters (I'm on Oracle):

[janine@pacad2 forums]$ fgrep -rl html_p *
sql/oracle/upgrade-0.1d-0.2d.sql
sql/oracle/upgrade-0.4d-0.5d.sql
sql/postgresql/forums-messages-create.sql
sql/postgresql/forums-messages-package-create.sql
sql/postgresql/upgrade-0.1d-0.2d.sql

However, every time I restart the server, it reverts back to thinking that the parameter is called -html_p, and when I call it with -format I get an error.  I turn on watching for the forums package and go into tcl/messages-post.tcl and add a comment to the ::new function.  Voila, everything works again until the next restart, even if I remove the comment.

I've even verified that it thinks I should be using -html_p by changing the name back in message-post.tcl and verifying that the error goes away.  This is truly bizarre.

Collapse
Posted by Andrew Piskorski on
At server startup, are you somehow sourcing in a second older copy of the forum::message::new Tcl proc from elsewhere in your AOLserver? Do you see anything suspicious in your server log up though the "Notice: Done loading OpenACS."?

Hm, I haven't paid any attention yet to how the new Tcl service contracts work. Is it possible that some service contract type thing is reading the PL/SQL function definitions with the "html_p" switch and dynamically defining Tcl procs that match? I kind of doubt it, but...

Collapse
Posted by Brian Fenton on
Hi Janine,

make sure to do a case-insensitive grep:

fgrep -ril html_p *

Brian

Collapse
Posted by Tilmann Singer on
A little contribution to the twilight spookieness in here: I remember being bitten once by a left-over emacs backup file that had a tilde at the end, like 'forums-procs.tcl~'. It was sourced sometimes instead of the original version but I didn't track down in which situations exactly it happened (maybe it was after deleting the original file and the backup file was still sourced). Anyway, propably irrelevant for your problem, according to the grep output.
Collapse
Posted by Reuven Lerner on
Emacs backup files!?!  Well, that *would* explain things, at least to a certain degree, given that I use Emacs.  (Er, doesn't everyone?)

Is the ADP parser (and/or the OpenACS templating system) so stupid that it tries to load backup files?  Can this possibly be the case?  (I'll try to look through the source code to find out, but someone else here is probably quite an expert on the subject.)

Collapse
Posted by Frank N. on
Reuven,

If it really is the emacs backups, which causes problems with watching files, then for a quick fix just check your new packages into your own CVS repository. Emacs will not generate the backups when it notices it is working in a file under CVS control.