Forum OpenACS Q&A: Re: AOLserver 4.0 Install instructions

Collapse
Posted by Gustaf Neumann on
It is safe to remove this Tcl_Panic. It is not part of 4.5 or naviserver, and was removed from the 4.0 repository in 2003 http://aolserver.cvs.sourceforge.net/aolserver/aolserver/nsd/tclobj.c?r1=1.6&r2=1.7
Collapse
Posted by Tom Jackson on
Ah, so Malte's patch is against an old version of AOLserver.

Gustaf,

A while back, like 9 months, you submitted an AOLserver patch, I think for 4.0.10 to trigger the driver thread when the queue becomes available (after an overload).

A similar patch was added to 4.5 to register an 'at ready' callback. In driver.c there is this line:

Ns_RegisterAtReady(TriggerDriver, drvPtr);

The problem is the corresponding NsRunAtReadyProcs is never called (it isn't used anywhere), at least in the CVS version of AOLserver. Since there have been massive code changes with the threadpools, I wonder if this is even applicable.

Also, could you describe the conditions which caused the previous version of AOLserver to hang? I'd like to test 4.5 under these conditions to see if there are any remaining issues.

I sent this to the AOLserver list, but it appears to be not posting any messages.

Collapse
Posted by Gustaf Neumann on
The situation does not happen under normal conditions, but only, when nsd is running out of resources. I have seen this happen only on openacs.org, which has quite limited resources (little memory, slow cpu). For a while, openacs.org went every night into a state where it did not respond to any request. For a while, members of the OCT restarted openacs.org each time it hangs (typically every day/night at least once). Without being able to fix this problem on the configuration level, the oct members decided in begin of dec 2006 to upgrade openacs.org to aolserver 4.5 with the hope, that the problem will vanish. However, upgrade did not help in this regard.

At that time i started to look more detailed into the problem, and found out it happened when all connection threads became busy and still more requests are queued. The problem was that every connection thread after the other went into a busy loop until reaching a complete freeze if the server. After running fine for a while, first only one connection thread went under high load into a busy loop (load 1, cpu 100%) while the server was still responding. Some time later, a second and third connection thread did the same (e.g. load 3), making the resource situation worse. Finally, when all connection threads went busy, the server hangs and stops replying.

Already in jan 2006 Jeff Rogers sent a patch for aolserver 4.0 to the aolserver list, dealing with a problem he was running into when benchmarking aolserver/openacs with ab (apache benchmarks). While benchmarking openacs aolserver run out of resources and reached a hang. Although he reported different symptoms (no busy cpu) i "ported" the patch to aolserver 4.5 . Since the patch is applied on openacs.org (mid of dec 2006), no hangs appeared again.

The patch was accepted for aolserver 4.0 and 4.5. For more details, see http://sourceforge.net/tracker/index.php?func=detail&group_id=3152&atid=103152&aid=1615787
http://www.webservertalk.com/archive388-2006-12-1352490.html

Tom, are you asking out of academic interest, or do you have problems with the patch, or are you fighting similar symptoms?

Collapse
Posted by Gustaf Neumann on
Tom wrote
Ah, so Malte's patch is against an old version of AOLserver.
The problem with panic is fixed in aolserver_v40_bp. So i would suggest to use in malte's script that version, changing
cvs -z3 -d:pserver:anonymous@aolserver.cvs.sourceforge.net:/cvsroot/aolserver co -r aolserver_v40_r10 aolserver
to
cvs -z3 -d:pserver:anonymous@aolserver.cvs.sourceforge.net:/cvsroot/aolserver co -r aolserver_v40_bp aolserver
Collapse
Posted by Tom Jackson on
Gustaf,

This is not academic at all. I see the registration of the callback in driver.c. But in the source code, I don't ever see the callback called. Somewhere NsRunAtReadyProcs must be called, or the TriggerDriver signal will never execute. I can't find this, either because it isn't there, or because my checkout is wrong or I can't use grep. I'm okay with any of these, but simply registering a callback and never using it seems unlikely to solve a real problem.

Also, I have no problems with the patch, unless it isn't actually being used by anyone. There are huge differences between 4.0x and 4.5 wrt this subject so I don't assume that the solution would be so easily moved to 4.5.

The code is relatively undocumented. I'm an expert in undocumented code. A few months or years later I wonder why things are like they are. Why did I do that? So if the code is doing something, I'd like to make a note so it isn't removed. Tricky stuff is worth noting.

Collapse
Posted by Gustaf Neumann on
Where i come from, the word "academic" has a positive connotation. now i understand at least, where you are after. You are right, when peaking around in the code, it is surprising that the registering a callback that seems nowhere to be called, helped ... but it did. NsRunAtReadyProcs() is defined extern, so a module might call it, but i don't see this either. My suspicion is that - since the problem looked like a race condition to me - registering the callback has the side-effect of serializing some threads. The mutex in the callback registration might have this effect.

If you have time to investigate further, i would suggest to take out the patch and try to reproduce the bug in a clean-house environment. It seems related to threads with a larger footprint (nobody saw it except openacs applications).

Collapse
Posted by Malte Sussdorff on
Just wanted to let you know I changed that in my script though I have to admit that the memory footprint of AOLserver 4.0 is considerably lower and it does not run out of resources. Sadly my installation of AOLserver 4.5 should have contained the patch Gustaf mentioned, so that might not be the explanation after all....
Collapse
Posted by Tom Jackson on
I agree with your use of the word "academic", it appears to apply here: I don't have a crashing server or anything, I was somewhat surprised that 4.0 and 4.5 could share a patch like this and have it actually work.

The fact that this helped out somehow (in a big way) is important to know.

I also noticed that NsRuAtReadyProcs() is extern, but it is only defined in nsd.h, not include/ns.h. This means, as I just learned, that it can't be in a module, it has to be compiled into libnsd. Btw, the only place I found any *Procs() called was in nsmain.c.

The registered callback is TriggerDriver, and the use of TriggerDriver has changed. The similar SockTrigger used to be conditional during Sock Close, but now TriggerDriver gets called anytime there is mutex unlock on the driver pointer structure (except in ns_driver query, where it is called inside the mutex). Maybe an important one is the fact that TriggerDriver gets called if, from Tcl, you query the driver! So some new Tcl diagnostic code could unstick the driver thread, maybe like a scheduled proc.