Forum OpenACS Q&A: carnage blender is live on aolserver 4 beta5

1: carnage blender is live on aolserver 4 beta5

Posted by Jonathan Ellis on 05/07/03 11:59 PM

For those waiting for aolserver 4, it's getting close. For me, it's already there -- it's far more stable for me than 3.5.1 was. Memory leaking, if any, is low, but I restart daily to reinit my cacheing, so YMMV. I've had zero crashes since I moved CB to it full-time Friday and no suspicious errors in my log.

Here's what I had to do, if you're curious:

download & compile nspostgres4 beta. I'm running oacs 3.2, so this was the only extra module I needed. (rl_returnz compiled with no changes.)
add nsdb to my init.tcl modules list -- otherwise, it doesn't get loaded in 4
specify -b ip:port in my startup; they broke it so it won't bind before switching to the non-root user otherwise, so port 80 gets permission denied

patch pthread.c so "ns_mutex destroy" doesn't have the potential of crashing the server. (aolserver developers don't feel this patch is worth applying officially but if you ever want to destroy a mutex, you pretty much need it.)

--- pthread.c   Sat May  3 19:04:52 2003
+++ pthread.c.old       Sat May  3 19:03:10 2003
@@ -181,14 +181,9 @@
 void
 NsLockFree(void *lock)
 {
-    int err = EBUSY;
+    int err;

-    /* spinwait until it's unlocked */
-    while (err == EBUSY) {
-       err = pthread_mutex_destroy((pthread_mutex_t *) lock);
-    }
-    /* {0,EBUSY} is supposed to be the only valid return codes for _destroy,
-       but we'll check anyway */
+    err = pthread_mutex_destroy((pthread_mutex_t *) lock);
     if (err != 0) {
        NsThreadFatal("NsLockFree", "pthread_mutex_destroy", err);
     }

patch tclvar.c so "nsv_incr" behaves the way it should: throw an error when the key doesn't exist. I have not yet submitted this patch.

--- tclvar.c    Sat May  3 20:17:44 2003
+++ tclvar.c.old        Sat May  3 19:53:02 2003
@@ -242,7 +242,7 @@
 NsTclNsvIncrObjCmd(ClientData arg, Tcl_Interp *interp, int objc, Tcl_Obj **objv)
 {
     Array *arrayPtr;
-    int count, current, result;
+    int count, current, result, new;
     char *value;
     Tcl_HashEntry *hPtr;

@@ -256,23 +256,21 @@
        return TCL_ERROR;
     }
     arrayPtr = LockArray(arg, interp, objv[1], 1);
-    hPtr = Tcl_FindHashEntry(&arrayPtr->vars, Tcl_GetString(objv[2]));
-    if (hPtr != NULL) {
+    hPtr = Tcl_CreateHashEntry(&arrayPtr->vars, Tcl_GetString(objv[2]), &new);
+    if (new) {
+       current = 0;
+       result = TCL_OK;
+    } else {
        value = Tcl_GetHashValue(hPtr);
        result = Tcl_GetInt(interp, value, ¤t);
-       if (result == TCL_OK) {
-           Tcl_Obj *obj = Tcl_GetObjResult(interp);
-           current += count;
-           Tcl_SetIntObj(obj, current);
-           UpdateVar(hPtr, obj);
-       }
     }
-    UnlockArray(arrayPtr);
-
-    if (hPtr == NULL) {
-       Tcl_AppendResult(interp, "no such key: ", Tcl_GetString(objv[2]), NULL);
-       result = TCL_ERROR;
+    if (result == TCL_OK) {
+       Tcl_Obj *obj = Tcl_GetObjResult(interp);
+       current += count;
+       Tcl_SetIntObj(obj, current);
+       UpdateVar(hPtr, obj);
     }
+    UnlockArray(arrayPtr);
     return result;
 }

2: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 12:01 AM

yeah, I generated those diffs "backwards." oops.

3: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Don Baccus on 05/08/03 01:29 AM

What version of Tcl are you using? Does it have the "file" command problem fixed? Or did you patch it? Or do you never use "file" commands so don't care?

Good news about the stability, cool.

4: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 03:03 AM

No, I don't use the file command. I use the same tcl I did with 3.5.1: 8.4.1.

5: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 07:20 PM

patch pthread.c so "ns_mutex destroy" doesn't have the potential of crashing the server. (aolserver developers don't feel this patch is worth applying officially but if you ever want to destroy a mutex, you pretty much need it.)

What!? "ns_mutex destroy" can crash AOLserver? That sounds like a major bug to me, and could definitely totally break some of the sites I run. Jonathan, can you tell use more about that bug, and why the AOLserver folks haven't or arent' going to fix it? Are the bug and patch on SourceForge?

7: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 07:28 PM

Don, I think you must be talking about the Tcl CWD thread-safety bug 710642. My understanding is that it does not have anything directly to do with using the Tcl "file" command, and you are not safe from the bug even if you never use any file commands at all. Zoran does have a patch for it up on SourceForge, so I guess you could patch Tcl yourself, but no further comment on the bug there since 2003-04-11.

6: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 07:37 PM

yes, check the original code -- if NsLockFree (Ns_MutexDestroy in 3.x) can't destroy the mutex b/c another thread is using it, it calls NsThreadFatal. Here's Jim's "official" explanation as to why:

"I think it's fine that NsThreadFatal kills nsd - design plan has always been that thread errors are fatal errors, i.e., process must be killed and fixed. As for the thread destory thing, idea is to use proper syncronization (e.g., Ns_ThreadJoin) to ensure when it's possible to destroy a mutex. Not doing so seems like sloppy code."

My response, which was ignored (twice) in private email, is that

there is no interface to check if another thread is using a mutex, so how can it be a programmer error if we don't provide a way to avoid it? (and a quick examination of the pthread functionality should convince you that we don't provide this because it's not possible to do so at that level.)
even if we jump through totally unnecessary bookkeeping hoops so that thread A, who wants to destroy the mutex, knows the thread id of B, who is currently using it, there is no guarantee that ns_thread wait on B, followed by destroy, would get the mutex before thread C, who is blocking on ns_mutex lock.

Really I have a hard time picturing a real-world situation where Jim's attitude makes sense.

8: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 08:19 PM

Jonathan, on AOLserver chat, Scott Goodwin just said he'll dig into the ns_mutex destroy issue more if you can send him some Tcl code that triggers the bug. Could you contact him, please?

9: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 08:28 PM

well, it's really easy. this code should work, er, crash fine.

setup:
nsv_set mutexes 0 [ns_mutex create]

thread A:
ns_mutex lock [nsv_get mutexes 0]

thread B:
ns_mutex destroy [nsv_get mutexes 0]

poof, there goes your server.

10: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 08:32 PM

Wait a minute, Jonathan, would it be sufficient to simply lock the mutex before trying to destroy it? Or am I missing something here?

If the problem exists in 3.3+ad13, I at least have never encountered it there, despite fairly heavy use of ns_mutex from multiple threads. But perhaps I coincidentally happen to not be triggering the bug.

11: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 08:38 PM

only if thread A doesn't unlock it before thread B issues the destroy. So if your code between lock and unlock is fast, and it should be, then you won't see this very often.

you could try grepping for mutex_destroy in your error log for kicks.

13: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 08:54 PM

I just checked and the same bug exists in 3.3, although NsLockFree is in pthread.cpp there.

12: Re: carnage blender is live on aolserver 4 beta5 (response to 11)

Posted by Tom Jackson on 05/08/03 08:55 PM

So, did you post this to the AOLserver list, or just send private email to Jim? Jim is not involved in the day to day development of AOLserver, and you might want a wider audience for your concerns. Discussions like this should be on a public list.

14: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 09:02 PM

yes, I posted the patch for 3.5.1 to the list -- the urls listserv stores them as are a mess, but if you search for mutex, since jan 2003, you will see it. then shmooved replied off-list and that's where the discussion stayed.

15: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 09:02 PM

Jonathan, are you saying that if thread B locks the mutex, AOLserver allows thread B to destory that mutex only if thread B first unlocks the mutex? Is that true? (I haven't tested it myself.)

16: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 09:09 PM

correct. if you try before thread B unlocks it, it calls NsThreadFatal.

17: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 09:16 PM

oops, I misread that as "if thread A locks the mutex, then..."

I haven't tried it in a single thread either, but lock followed immediately by destroy should also result in a crash in an unpatched nsd. The man page says,

pthread_mutex_destroy destroys a mutex object, freeing the resources it might hold. The mutex must be unlocked on entrance.

(with my patch, doing this will cause your thread to loop indefinitely. I don't really consider that a drawback, because in normal use if you forget to issue an unlock call you will also see an infinite block next time another thread tries to lock that mutex.)

18: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 09:35 PM

Jonathan, the following Tcl code works just fine:

set lock [ns_mutex create] 
ns_mutex lock $lock 
ns_mutex destroy $lock 
ns_log Notice "atp: Mutex destroyed: '$lock'"

Therefore, I don't think this issue is nearly as serious as I first feared. The proper thing to do is have your Tcl code always lock the mutex before destroying it. That will guarantee you never have any problems.

Now, it might well be a very good idea to change AOLserver such that if you forgeting to lock the mutex could not possibly cause AOLserver to crash. However, I think this is a nice-to-have rather than release critical feature.

19: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 09:40 PM

interesting. I guess it's not the first time the manual has been wrong.

20: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 09:46 PM

Hm, there still may be some form of race condtion even when properly locking the mutex from Tcl before destroying it, as in both the AOLserver 3.3+ad13 and 4.0beta4 C code, Ns_MutexDestroy calls NsLockFree(mPtr->lock) before calling Ns_MasterLock().

Also, the comment in Ns_MutexDestroy that, "Note this routine is almost never used as mutexes normally exists in memory until the process exits." is simply utterly wrong. Mutex destroy functionality is exposed via the Tcl API, and the Ns_MutexDestroy C function has absolutely no way to know about or implement any policy on when or why ns_mutexes are destroyed.

21: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 09:59 PM

no, that should be fine -- the Ns_MasterLock is only for the nsd internal data structure manipulation.

22: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/08/03 10:06 PM

Without looking more closely at the code, I suspect the correct fix may be to:

Add something at the beginning of Ns_MutexDestroy to first locks the mutex if it is not already locked, in order to protect from a mistake in the user's Tcl code. However, no, I don't think that is guaranteed to be a complete solution, as Ns_MutexLock does not (and should not) call Ns_MasterLock, but it should greatly reduce the window during which a failure could occur. In practice I bet it would eliminate all instances of the server crashes Jonathan saw.
Possibly add something like Jonathan's patch as well.

Um, actually, my previus post was probably somewhat in error. While I think it's true that there is a race condition in Ns_MutexDestroy (see above), I don't think calling Ns_MasterLock before NsLockFree would do anything to fix it. Ah, yep, Jonathan, you already figured that out.

23: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/08/03 10:28 PM

OK, I built an nsd (3.3 ad 13 to match andrew) w/o my patch. My man pages are indeed correct for my system [mandrake 8.0 with patches] -- I start it up, create/lock/destroy a mutex in a single nscp session, and get

nsthread(1175) error: ptread_mutex_destroy failed in NsMutexDestroy: Device or resource busy
Aborted

and nsd is dead.

[On my patched server, I was correct, it loops infinitely.]

24: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 05/09/03 09:58 PM

Jonathan, very interesting. My previous test, where

set lock [ns_mutex create]
ns_mutex lock $lock
ns_mutex destroy $lock

worked fine, was on Solaris (SunOS 5.8). I just tried it again on Linux (2.4.18, Debian 3.0), and there it crashes AOLserver every time, with the same error message you gave.

So it seems at least on Linux, there is currently no guaranteed safe, correct way to destroy a ns_mutex from Tcl. Not good...

25: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 05/09/03 10:03 PM

actually, I am experiencing no problems using mutex destroy with my patch. just don't try to lock it first. :P

(I think you are mistaken about other race conditions... the pthread code takes care of those nicely.)

26: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 09/22/03 09:20 PM

Hey Jonathan, did you ever hear back from Scott Goodwin or anywhere else about the ns_thread destroy problem? Did they accept your patch or otherwise provide some way to guarantee that a thread doing "ns_mutex lock $lock ; ns_thread destroy $lock" will not crash AOLserver on Linux?

27: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Andrew Piskorski on 09/22/03 10:13 PM

I submited a bug on SorceForge for this, ns_thread destroy can kill AOLserver on Linux.

28: Re: carnage blender is live on aolserver 4 beta5 (response to 1)

Posted by Jonathan Ellis on 09/22/03 10:34 PM

No, they didn't. I ran an experiment and it turns out having up to a million mutex handles around didn't cause problems. This program runs about instantaneously:

#include <pthread.h>
#include <stdio.h>

pthread_mutex_t a[1000000];

void main(int argc, char **argv)
{
    int i;
    for (i = 0; i < 1000000; i++) {
        pthread_mutex_init(a + i, NULL);
    }

    printf("Done creating\n");


    for (i = 0; i < 1000000; i++) {
        pthread_mutex_destroy(a + i);
    }
}

ns_mutex is just a thin wrapper around pthread_mutex calls. So I figured having around the measly 10k mutexes I use on my site won't cause me any problems and just stopped destroying unused ones. So perhaps a politically acceptable fix for the nsd guys might be to just not provide mutex destroy anymore.