Forum OpenACS Development: Excessive load, probably corresponding to hung threads downloading big files

The load on a newly launched OpenACS 5.1.4 production site is very close to 0 (0.00 0.00 0.00 in top) when there are no users. However, more typically the load is at 1.00, or 2.00. The site is still fully responsive. I have taken the following steps to diagnose, all with guidance from people on IRC:
  • At first, I couldn't see anything in ps with a high %CPU. This is because the host is running linux kernel 2.6, which doesn't show the process threads by default. I used the command
     ps -A a x u H -T
    to see the runaway thread. This reveals one or more nsd threads that are using 100% of CPU.
  • To determine what the threads are doing, I created a file called test1.tcl (test conflicts with acs-automated-testing) with the contents
    foreach item [ns_info threads] {
        ns_write "$item\n"
    }
    This reveals that the culprit threads are downloads of large files from file-storage. They apparently fail, but then linger on.
Questions:
  1. Why are these downloads failing?
  2. How do I fix it?
  3. Does this diagnostic functionality already exist?
Do you have any idea as to how frequently these downloads fail, in terms of percentage of requests?

There was a big mystery in ACS some years back where Oracle or the AOLserver Oracle driver would kick some sort of 0 byte write error occasionally.

The mystery was solved when someone finally proved that dropping the connection during the midst of the return of content would trigger the error. People still see these on Oracle sites, for this reason.

I'm sure you're running PG, of course. I make the above point to make clear that even with our modern day network, high speed connections, etc ... socket connections are dropped at times.

What's weird here is that the thread's still running. That may mean that there's no chance in hell that sockets are being dropped (oh, and if a very high percentage of downloads are failing that probably rules out socket drops too).

Or it may mean that there's some messed-up recovery code in AOLserver, our code, both, whatever.

Are you returning this content from the file system, or from PG? PG binary file return - more properly the driver hack I implemented so long ago to work around PG's poor binary file handling capability - is SLOW which is why we recommend mapping CR binary content to the file system. ns_return_file should take very little CPU time.

100% of these downloads fail (though I don't think there has been an attempt to leave the browser open for, say, hours). I followed Jeff's advice to do an strace of the thread and it appears to be mapping and unmapping memory continuously. There are only a few such files, so my workaround is to move them out of file storage and onto the file system and then serve them up relatively directly instead of through file-storage.
Joel,

Is file-storage putting those files in the database or the filesystem? If they are in the filesystem OpenACS should be using ns_returnfile which should behave exactly the same as a file stored under www.

Joel, do you know if the size of what's being mmapped corresponds to the file being returned? If so that might point to a bug in aolserver (assuming the file is being returned via ns_returnfile it would be mmapped before being written to the client).