Forum OpenACS Development: What issues remain for OpenACS on AOLserver4
The AOLserver Core Team would like to know what outstanding issues exist for using OpenACS on AOLserver4.
Also, it looks like the Site Map returning "no data" or "broken url" is AOLServer4 specific (I haven't run earlier AOLServer versions with OACS but others have commented that it worked).
Thanks Tammy, it looks like the filter error bug has been fixed. Vinod also seems to have figured out that the (incorrect) use of
ns_eval was causing the problem with the site node map. I don't know what his solution will be, but he will probably substitute an nsv array for the bad code.
I think the nsv bug is still open, this should be a high priority to fix.
It is very helpful to report these bugs. OpenACS has a huge code base and it serves as a great test of AOLserver code.
>When trying to access the webstats area or some static
>html pages everything works ok (pages are devlivered),
>but when I try to get a page that is using the openacs
>templating system there is no answer. I could trace the
>problem to the function "ns_adp_parse". That function
>does not return and does not deliver an error either. It
>seems to break when compiling tags defined by openacs.
And I haven't seen the site map problem, though perhaps I didn't understand the bug report and just haven't visited it in a way that triggers the problem.
The biggest issue has nothing to do with OpenACS - the fact that code that's not threadsafe has snuck into the implementation of the "file" command, leading to seemingly random crashes or (even worse) memory corruption that can cause files to be written to seemingly random directories etc etc.
What's the status of this problem?
It's a TCL 8.4.x problem, not AOLserver problem, but since AOLserver 4.0 requires Tcl 8.4.x it means one can't use AOLserver 4.0 in a production system.
Don, thanks for posting this one. It is like we have our hands tied, since it isn't a AOLserver bug.
The bug is due to passing a Tcl_obj between threads (in the cd command), and it hasn't been fixed. Apparently the person responsible for this is on vacation or busy. I'm amazed that such a problem would occur and it wouldn't be tagged as a critical bug. On some platforms (BSD derived), using pwd calls cd, so the bug is much worse. I wasn't aware that the file command had a problem too.
Also, the nsv bug has been fixed.
You're right, it is disappointing that the Tcl team hasn't treated this as being of being a crisis-level bug needing an immediate fix. I guess not that many people use Tcl threaded outside the AOLserver community ... on the other hand AOLserver users represent a significant slice of the Tcl community.
Has the Tcl core team or Ousterhout himself been contacted about this (perhaps with a patch)?
/Lars (who hasn't looked into AOLserver 4 at all but feels like he should)
I'll post your questions to the AOLserverCore list. I need an update, and a chance to hopefully nudge things along.
From what I understand right now Zoran Vasiljevic (a Core Team member) is developing on Darwin. On this platform the cd command is used behind the scenes in a number of commands, so the problem shows up very quickly. All BSD variants seem to hav e the same bug. So he discovered the bug and traced it to the passing of tcl objects (pointers?) between threads. The fix is supposed to need to change this to something threadsafe.
In his original 3003-03-26 post to the AOLserver list reporting the problem, Zoran said:
The problem is in Tcl generic/tclIOUtil.c and naive handling of static Tcl_Obj *cwdPathPtr. The pointer to this Tcl object gets shuffled arround threads by simple reference, it is read (referenced) without proper locks, etc. The implementor obviously protected the most obvious write operations, but neglected any others. Also, the Rule#1 in Tcl "Do not pass Tcl_Obj's between threads" is grossly violated.
And here's a different, more round-about take on why it's probably a difficult problem:
Back in March when this came up on the AOLserver list, I didn't understand that the current working directory of a process is maintained process by the kernel, not by the process itself. So I was speculating about maybe being able to fix things by simply giving every thread it's own independent thread local storage (aka, thread specific data) CWD. Here's what Rob Mayoff had to say about that:
Perhaps you do not realize that a process's current working directory is tracked by the kernel, not by the process. Tcl keeps track of its CWD for speed, but ultimately it's the kernel, not the process, that resolves relative pathnames, so it's the kernel's idea of the CWD that matters.Note Rob's last line - scary! Zoran independently said much the same thing:
I believe that POSIX requires that all threads in a process share a working directory. Making each thread appear to have its own working directory requires either non-standard kernel support for per-thread CWD (which Linux has, but I don't think you can get to it through the pthreads interface), or intercepting every system call that involves a pathname (open, link, symlink, unlink, rename, access, stat, lstat, chdir, chroot, chmod, chown, lchown, mknod, mkdir, rmdir, bind, connect, and probably some more that I've forgotten). You might be able to ignore some of these for AOLserver, but intercepting any of them isn't necessarily easy, and it's definitely not possible to do so portably.
It still might be the best way to fix this problem, though.
Eh, the cwd is the thing which is used by most path-related sys/lib calls to resolve the absolute path of the file. It is tracked in the kernel, not in the process, so in order to make this happen, you ought to intercept *all* of the sys/lib calls fiddling with paths. Now, Tcl with its virtual filesystem *might* achieve this, since it really isolates the upper layers from the OS-specifics. But, if you ask me, I think this is voodoo.
To be honest, I was also playing with this idea, but after giving it a serious thought, I've abandoned it.
Anyway, Zoran was working on fixing the bug, and last we heard he had some sort of fix (maybe only partial, I'm not sure) as of March 27, but it wasn't in the Tcl core yet. I haven't heard anything since then.
Oh yeah, and totally off-topic: This business of CWD always being tracked by the kernel, etc., is making me think that the exokernel guys really do have the right idea, and that safely multitasking the hardware and providing nice system call abstractions should be independent features of the OS environment, not both mushed together into the one system-wide kernel.
Zoran says he has a fix for the cd bug. Here is his reply to Lars' questions:
There is, albeit I think it will be better to push 8.4.3 out, ASAP. > Has the Tcl core team or Ousterhout himself been contacted about this > (perhaps with a patch)? > Yes. Patch is posted to SF and the person involved is aware about it. Cheers, Zoran
You can read the announcement, release notes and changelog too.
This release fixes a multithreading issue, which boiled down to tcl's "cd" command not being thread-safe. While most users didn't notice (because they didn't use cd in threads), aolserver cds a lot, so we noticed :)
Also, an issue that occured in tcl-8.4.0 and was fixed in tcl-8.4.2 involved an erroneous status code from "catch".
(details: the problem was that the return value of catch was TCL_OK (0) when it should have been TCL_RETURN (2) when encountering a return statement.)Thanks to Mark Dalrymple for tracing the problem he saw to catch; I then asked the tcl people if that was a known issue, and it was. This issue affected openacs because it checks some invocations of catch for return value of 2 in db_exec and friends. A suggestion was made that the sense of the test be reversed, and the code should check for TCL_ERROR (1) instead. This way, the important condition (error occured) is tested for, not whether or not a return statement was encountered in the catch block. Thanks to RockShox on freenode irc's #tcl channel for that suggestion.
OpenACS + aolserver 4.0 on Solaris - I have been having this problem, and am puzzled as to what could be going on. Is there anyone who has managed to get OpenACS + aolserver 4 working on Solaris?
Should this wait until AOLserver4 is actually out of beta? One nice thing is that it is relatively easy to replace AOLserver in and OpenACS installation, but one issue still open is SSL. If this function is moved to a proxy server it might be a go at this point.
> it might be a go at this point.
If SSL is moved to a proxy server then you must serve all requests via SSL, or use a "smart" proxy server that knows what parts of your site need to be served via SSL. You cannot communicate this information back into OpenACS without some hacking, so I would vote that we not recommend AOLserver 4.0 until there is a working SSL implementation.
Pound has the issue with ns_write but there is a patch in the making to removing this limitation. In all other respects, I found Pound to be better. For example, Pound can add a custom header to requests forwarded to AOLserver when the request comes in as a HTTPS connection to Pound. Using this information, I have modified to the security procs of OpenACS to treat these requests as if they were HTTPS connections to AOLserver.
The big win is that security management becomes transparent to OpenACS. One can still use the same security methods in OpenACS as before.
Also, nsopenssl should not be far of for AOLserver 4.0.
All in all, AOLserver 4.0 can be used with OpenACS under certain circumstances:
1) When the site doesn't require SSL
2) When the site uses SSL but off loads the SSL handshake to Pound and user pages don't use ns_write
3) When the site uses SSL but off loads the SSL handshake to Pound and Gustav's patch is applied to Pound.
Options 2) and 3) also require my hack to OpenACS. Should I be committing this hack to CVS?
Barry, the patch Bart is talking about is for Pound. One of the Pound maintainers posted to the AOLserver list with info about it. Basically, AOLserver happens to use older style syntax for some HTTP stuff, which Pound didn't support yet, so he's adding it.
I did run into some redirect problems but did not see anything with ns_write. Is the patch you have for pound or aolserver?