Forum OpenACS Q&A: Template corruption?

Collapse
Posted by Reuven Lerner on
I'm running a stock OpenACS 4.5 system against PostgreSQL.  Over the
last few weeks, I've noticed a number of odd error messages when
trying to access ADP pages.  Reloading the page numerous time results
in a number of different error messages -- extraneous text from other
pages, Tcl errors complaining that variables weren't set, and other
such things.

For example, I created a (mostly) empty foo.tcl, and a (mostly) empty
foo.adp.  foo.adp contained little more than a call to <master>,
followed by <h2>foo</h2>.  When I went back to the index page, I
suddenly began to see "Foo" in a headline, or a Tcl stack trace, or
just the "<" bracket, or a totally OK page.  Reloading cycled me
through these errors.

It's possible that one of the designers played fast and loose with the
ADP, and I'll try to identify and fix that on my end.  But it's a bit
disturbing to think that one ADP page can affect another, be it in
functionality or in the content displayed.

Am I totally crazy?  Is ADP broken?

Collapse
Posted by Jeff Davis on
It sounds strange. Have you set your DefaultParser to fancy?
ns_section ns/server/${server}/adp
ns_param   DefaultParser fancy
What modules do you have loaded?
Collapse
Posted by Reuven Lerner on
Yes, I have DefaultParser set to fancy.  Overall, the OpenACS 4.x ADP templates are working just fine, and have been working fine since I first installed the system.  It's just recently that we've started to notice these weird instabilities.

Jeff, are you asking what OpenACS modules I have loaded, or what AOLServer modules I have loaded?

Collapse
Posted by Jeff Davis on
aolserver modules.  Whenever I hear things working sometimes and not
others I think it might be data being corrupted in one thread and not
another (more likely caused by aolserver modules than by
anything a designer might do).  You might post the stack trace as well
since that would probably be informative.
Collapse
Posted by Samir Joshi on

Reuven, when I upgraded to 7.3 Red Hat - I started getting similar problems to what you mention using Konqueror browser ( 3.0.0-12). I suspected that the browser is not fetching/submitting right information from/to the server. Changing cache setting of the browser did not help - but things work fine if I use a different browser ( Netscape/ Mozilla / IE ). So I concluded it to be a browser-specific problem.

Collapse
Posted by Reuven Lerner on

Samir, I would probably chalk it up to browser problems if we didn't see this on multiple browsers. Maybe Galeon is giving me problems, but I find it hard to believe that Galeon *and* two different installatiosn of IE are having these. Besides, when a Tcl error occurs, I see the Tcl error message (and stack trace) in the server's error log.

Speaking of which, here's the stack trace that I see every fourth or fifth reload of the page:

[11/Sep/2002:13:51:13][17659.4101][-conn1-] Error: GET /  can't read "signatory": no such variable
    while executing
"append __adp_output "

<hr>
<address><a href="mailto:${signatory}">${signatory}</a></address>
${ds_link}
</body>
</html>
""
    ("uplevel" body line 3)
    invoked from within
"uplevel {
          set __adp_output ""
append __adp_output "

<hr>
<address><a href="mailto:${signatory}">${signatory}</a></address>
${ds_link}
</body..."
    (procedure "template::code::adp::/web/melton/www/index" line 2)
    invoked from within
"template::code::${template_extension}::$__adp_stub"
    (procedure "template::adp_parse" line 57)
    invoked from within
"template::adp_parse [file root [ad_conn file]] {}"
    (procedure "adp_parse_ad_conn_file" line 7)
    invoked from within
"$handler"
    ("uplevel" body line 2)
    invoked from within
"uplevel $code"
    invoked from within
"ad_try {
        $handler
      } ad_script_abort val {
        # do nothing
      }"
    invoked from within
"rp_serve_concrete_file [ad_conn file]"
    (procedure "rp_serve_abstract_file" line 60)
    invoked from within
"rp_serve_abstract_file "$root/$path""
    ("uplevel" body line 2)
    invoked from within
"uplevel $code"
    invoked from within
"ad_try {
        rp_serve_abstract_file "$root/$path"
        set tcl_url2file([ad_conn url]) [ad_conn file]
        set tcl_url2path_info([ad_conn url]) [ad_conn path_inf..."

Normally, this error means, "you forgot to define the variable signatory in your .tcl page, dummy!" But I see this only sometimes, not whenever I reload the page. Some reloads are fine, some have that corrupted HTML output, and others produce this stack trace.

(Of course, since I restarted the server, everything has been fine...)

As for what modules I'm using:

[reuven@mail-gw aolserver]$ ls bin/*.so
bin/nscache.so  bin/nsext.so   bin/nsrewrite.so  bin/nsxml.so
bin/nscgi.so    bin/nslog.so   bin/nssha1.so     bin/postgres.so
bin/nscp.so     bin/nsperm.so  bin/nssock.so

Oh, one other tidbit: One of the developers on the project, I've discovered, has been using ns_puts inside of his .tcl pages for verifying and debugging. Could this cause such problems?

Collapse
Posted by defunct defunct on
Hmm.. ns_puts... I guess it depends on where and how he's used it..

This is looking more and more like a code bug though.... as you sure there are no circumstances under which the tcl page might exit without having set signatory up?

Is it possible to post up the source for the TCL/ADP pair?

Collapse
Posted by Andrei Popov on
I have seen something similar over the last couple of days on my test server.  Funnier yet, the same variable gets properly evaluated in one statement (db_string) and then fails completely in another (db_multirow).  But there I at least have consistency in terms that it always fails in the same place.
Collapse
Posted by C. R. Oldham on
By chance does anyone who is experiencing this have the OpenACS developer-support module loaded and enabled?

--cro

Collapse
Posted by Reuven Lerner on
Well, our failures are global, which leads me to believe that

Things are a bit complicated on our server, since we've implemented (don't shudder too much, now) an HTTP-based chat client that updates its list of currently online users with every HTTP request to the server.  With each request, we're thus performing some database work *and* writing to an XML file on disk that lists the currently available users.  (There is some method to this madness, trust me.)

In going over things with the programmer who implemented this, I saw that with each HTTP request, we were opening a filehandle (named file) that was declared to be global at the top of a Tcl proc.  (I know, I know, global variables are evil.  You don't have to tell me.)  In any event, some of the corruption that took place in other pages on the system looked identical to what was supposed to be dumped into the disk file.  My guess, although I can't prove it, is that the combination of global variables and threads meant that the filehandle wrote to the user's Web browser rather than to the disk file.  This would account for many of the errors we've seen so far, but not all of them.

I'll send more questions or answers to the forum, as they arise!

Collapse
Posted by defunct defunct on
On that note, please be aware that each thread has its own tcl interpreter, and therefore there is no guarantee that a global will persist from one call to another! (depedending on re-use of threads etc)

I suspect this may be your problem... perhaps you need to consider using  an alternative storage method.

Collapse
Posted by Reuven Lerner on
Simon --

The reason why I began to suspect threads is precisely the fact that we saw the problem inconsistently, which meant that different Tcl interpreters were getting different versions of that global -- some of which worked, and some of which didn't.  Things look better now that the global variable has been made into a parameter, but I want to run through some more testing before declaring victory.

Collapse
Posted by Don Baccus on
If you're trying to share global data across threads, you must use nsv variables.
Collapse
Posted by Tom Jackson on

I have seen this type of problem often, usually the server cycle through three different errors, sometimes displaying the correct page. I think what I figured out is that sometimes your script is supposed to end execution, but doesn't, and the adp gets executed anyway, even though the .tcl script were aborted.

For instance (I am running from memory here) ad_maybe_redirect_for_registration, might redirect, but the proc itself returns to the tcl script, so you get two pages returned, more or less. Also, there are similar problems with ad_return_complaint. I think I had to use something like:

ad_return_complaint 1 "Your data is bad"
ad_return_template "blank.adp"
return -code return
in order to get the system to work. This required an empty file called blank.adp in every directory.
Collapse
Posted by Reuven Lerner on
Sure enough, turning the global variable into a parameter solved the problem.  I feel good about debugging this issue (especially since it wasn't my code), but I have to thank Jeff, who first got me thinking that one thread was corrupting the others.

The moral of the story, of course, is that global variables are almost always a bad idea, and that you should have a really good reason for using them -- particularly in a multithreaded environment.

Thanks to everyone for the excellent help!

Collapse
Posted by Jeff Davis on
you should use ad_script_abort rather than return -code return since
in some cases you are more than two calls deep and ad_script_abort
will still work in that circumstance.
Collapse
Posted by Tom Jackson on

It looks like ad_script_abort calls ad_raise, which just calls return -code error ....

It isn't always an error to stop script execution, so that seems incorrect. Are you sure return -code return is limited by the tcl level, I have never seen problems using this.

Collapse
Posted by Jeff Davis on
Tom, return -code return just unwinds two levels rather than one. You can see the difference with this small example:
proc IamTheTop {} { 
     puts "Starting"
     catch { a } errMsg
     puts "Finishing $errMsg"
}

proc a {} { 
     puts "to b" 
     b
     puts "from b"
}

proc b {} { 
     puts "to c" 
     c     
     puts "from c"
}

proc c {} { 
     puts "return -code return"
     return -code return
}
which should produce:
Starting
to b
to c
return -code return
from b
Finishing
You can see that function "a" picked up processing which might be what you want but in most places I see return -code return it probably is not what people expect. changing the above to return -code error yields
Starting
to b
to c
return -code error
Finishing
You generally will not get in trouble with return -code return in a .tcl/.adp page but it can definitely be a problem in library calls and in some of the code blocks like on_error where the return codes are not necessarily all handled correctly.