Forum OpenACS Development: Re: OpenACS Performance Tests

Collapse
Posted by Gustaf Neumann on
Eduardo,

the aolserver has the bad behavior, that it is likely that the thread termination parameters (maxconnections, threadtimeout) terminate all connection threads at the same time. This happens, when these threads were started at more or less the same time, which is likely on busy sites or during benchmarks. This mass extinction of threads is performance-wise not a good idea, especially when most of the threads have to be recreated afterwards. Termination and recration of threads are costly operations learning to noticeable delays.

To overcome this problem i have committed a small change to aolserver 4.5 head, that introduces a "spread", a randomization factor for the two mentioned thread termination parameters. The spread will lead to slightly different values for maxconnections and threadtimeout per connection thread (default is +- 20%). When spread is 0 one obtains the original behavior. You might be interested in trying this version.

Collapse
Posted by Tom Jackson on
Gustaf is right here. The AOLserver code does not take into account the size of the queue or the number of requests which could be served by a thread before allowing the thread to exit.

There is a basic logic problem in the code, but it really only shows up under test conditions. Two real world complicating factors are running very few threads (less than 10) and serving long running requests (slow client or huge data).

My experience is that external benchmark programs are less reliable than AOLserver, so it is very difficult to use their results. Most of the slowness in OpenACS will be in database queries, which will be impossible to detect using an external tool.

OTOH, if you expect hundreds of simultaneous visitors to a DB backed site, be happy with your success and look into distributing your site over a number of front end servers.

Collapse
Posted by Eduardo Santos on
Hi Gustaf and Tom,

Thank you very much for your replies. I'm sorry for the time I've taken to post, but I had some personal issue to solve.

With our tests and observations, we are writing some kind of benchmark howto document that should go somewhere in the community, maybe in the XoWiki instance. Our tests had three branches of observation:

1 - AOLServer (OpenACS?) Tunning
2 - PostgreSQL Tunning
3 - SO Tunning

I'm going to try to make a brief description about the specific and generic observations we could find:

AOLServer (OpenACS? Tunning)

Our first issue with AOLServer was the 4.5.0 parameters problem that somebody fixed with the file /acs-tcl/tcl/pools-init.tcl as said in this post in Tom's message. With a simple update we where able to solve this issue.

The other problem was with the thread creation process wich, as you just said, is a mixed XoTcl + AOLServer problem. The most important thing we've realized is that the creation and destruction process consumes a lot of I/O operations. To improve the I/O performance we've tried to change the file system, but it had no effect due to the most important thing we've find out: DON'T EVER USE VM's IN PRODUCTION SITES.

Our server was based on Xen VM's and it was impossible to have I/O performance with Virtual Machines. The whole thing about it is that there's no virtualization process able to split the blocks and inodes completely when using virtual machines, so all the I/O is shared between the VM's in the Cluster. It's a little bit different from what happens with logical partitions possible in some hardwares, such as IBM's big servers. In that case the files are completely separated and the partition works with no I/O issues.

Based on this observation, we've switched the production environment to a dedicated server with the specifications described in the first post of this thread, and most of the problems with threads creation and destruction are gone.

The next step was to adjust the configuration file. I guess the biggest challenge to everybody using OpenACS is the best relation between number of users X maxconnections X requests X maxthreads. This is the part where we are stuck right now. According to www.mail-archive.com/aolserver@listserv.aol.com/msg09549.html" title="Gustaf's Post">this Gustaf's Post on AOLServer list:

the question certainly is, what "concurrent users" means (...) note, that when talking about "concurrent users", it might be the case that one user has many requests concurrently running. To simplify things, we should talk about e.g. 10 concurrent requests.

Then you need
- at least 10 threads
- at least 10 connections
The number of database connections is more difficult to estimate, since
this in an application(oacs) matter. In oacs, most dynamic requests need
1 or 2 db connections concurrently.

I know this is not like a cookbook, but using his observation we could think about one thread, one connection and 2 db connections for each request. With these parameters, concerning that the tests are going to perform all the requests we configure at the same time, the results are most likely following this logic. When we set the maxthreads to 150, there's a little bit of degradation in memory trying to serve 100 simultaneous requests. This degradation is over when you set the parameters minthreads to 80 and maxthreads to 120. What we can get from here is that one thread is able to serve one request in the best performance adjustments.

However, when you send these setting to production, there's maybe the most difficult problem: to estimate the number of requests per user. Maybe one patch in xotcl-request-monitor could answer this question, but we are still thinking about the best way to do it. We could also see that the number of connections per user is very different from the relation 1 X 1, and this another thing we are trying to find the best relation.

PostgreSQL Tunning

All the tests we are performing until now consider DB and OpenACS in the same box. The goal is to find out the performance limit to this setting, so we can remove PostgreSQL from the machine and measure how it gets better.

There's a lot of material for PostgreSQL in the Internet, and I'm not one specialist myself, but I guess there are some specific observations that could be done.

Everybody says that Internet applications are doing most of the time SELECT operations in the database. This is a myth. If you consider number of operations maybe it can be true, but it's not when you consider execution time and resources usage. The number of TPS (Transactions Per Second) in an application using OpenACS is very large, and that's a good thing. In the old Internet you where creating content to people see. The new paradigm says the opposite: now the users create the content, and that's why I use OpenACS. There's just no better option to build social and collaborative networks in the market.

Concerning this matter, most of PostgreSQL tuning involves transaction improvements. I can't still understand completely the pools mechanism that OpenACS uses for PostgreSQL, and I guess some improvement in this area could make my tests better.

The most important thing to adjust here is the shared memory mechanism. We've seen that, if you put a too big number in PostgreSQL and OS, the memory shared between PostgreSQL and AOLServer can cause the system to crash under stress situations, and that's not a good thing. The I/O becomes a problem with a large number of INSERT and DELETE operations, mostly because the thread creation process is also heavy for the system.

The conclusion is: if you want to have the best performance, you really have to split AOLServer and PostgreSQL in different boxes. The exact point to do it (DB size, number of users) is what we are trying to find out.

OS Tuning

Maybe this is the most difficult part to be done, because Linux has a lot of options that can be changed. A better analysis concerning resource usage is necessary so we can have better numbers. I'm going to put here a list of parameters we are changing in a Debian GNU/Linux Etch system:

# echo 2 > /proc/sys/vm/overcommit_memory
# echo 27910635520 > /proc/sys/kernel/shmmax
# echo 32707776 > /proc/sys/kernel/shmall
# echo deadline > /sys/block/sda/queue/scheduler
# echo 250 32000 100 128 > /proc/sys/kernel/sem
# cat /proc/sys/fs/file-max
753884
# echo 16777216 > /proc/sys/net/core/rmem_default
# echo 16777216 > /proc/sys/net/core/wmem_default
# echo 16777216 > /proc/sys/net/core/wmem_max
# echo 16777216 > /proc/sys/net/core/rmem_max

# su - postgres
postgres@nodo406:~$ ulimit 753883

There are some other things, such as Kernel changes and TCP/IP configuration we are changing, but I don't think we have the better adjusts yet, so let's wait a little longer.

That's it. Sorry about the long post, but I guess I should give the community some qualified feedback, concerning all the help you guys always give to us.

Collapse
Posted by Gustaf Neumann on
Eduaro, short comments to some of your statements:

- The last change in /acs-tcl/tcl/pools-init.tcl was 15 months ago. You might have been affected by change in the default pool order, http://fisheye.openacs.org/browse/OpenACS/openacs-4/etc/config.tcl?r1=1.47&r2=1.48

- i don't see an kind of "XOTcl problem" in this discussion (maybe i am overly sensible in this question). The only "problem" is see is that the xotcl-core writes a message into the error log, when aolserver threads exit. They would exit as well without XOTcl being involved.

- setting maxthreads to 150 is very high. We use on our production site (more than 2200 concurrent users, up to 120 views per second) maxthreads of 100. by using background delivery, maxthreads can be normally reduced. keep in mind that ever connection thread (or thread for scheduled procs) contains a full blueprint of all used packages, these are 5.000-10.000 procs). In other words, for every thread there is a separate copy of all procs in memory. We are talking about 500.000 to maybe more than one million (!) tcl procs, when 100 threads are configured. This is not lightweight. When all threads go down (as sketched in my earlier posting), all these procs are deleted and maybe recreated immediately in new threads.

- "difficult problem ... estimate the number of requests per user": The request monitor gives you on the start page the actual number of users and the actual number of requests per minute (in the graph), and the actual views/sec and the avg. view time per user. If you have an average view time (AVT) per user of say 62 seconds, then the views per user (VPU) per minute is VPU = 60/AVT = 0.96.
Therefore, 100 users generate U*VPU requests per minute (with the sample values: 96). Getting estimates on Requests per user is just the simple part. More complex is to figure out, how many threads one will need for that. One needs the average processing time, which is unfortunately depending in reality, how many other requests are currently running (all requests share common resources and are therefore not independent).

- Anyhow, the request monitor shows you as well the number of currently running requests, and by observing this value one can estimate the number of required threads. If you have eg. 10 connection threads configured, and you observe that often 8 ore more requests are running, you are already running into limits. Once all connection threads are busy, the newly incoming requests are queued, causing further delays. Even simple requests (like e.g. a logo) can take significant time, since the response time to the user is queuing time + processing time. If the QT is 10 seconds, every request will take at least 10 seconds. If the machine was not able to process the incoming requests quickly enough with the configured threads in the past than it is questionable, when it will be able to catch up. So, queuing requests should be avoided. This is also one place, where the background delivery helps since it frees connection threads much earlier to do other work. On the other hand, if you have e.g. 30 connection threads configured, and you see normally only 2 or three concurrent requests, you are most probably wasting resources.

- i do not agree with the conclusion "for best performance, you really have to split AOLServer and PostgreSQL" to different servers. It depends on the machines. Maybe you are correct with dual or quad-core Intel processors. If the database and aolserver are on the same machine, and you have enough memory and your have enough CPU powers (e.g. 8 cores or more) and your machine has a decent memory-throughput/latency, then local communication between database and aolserver is significantly faster. There are situations, where performance is better, when database and aolserver are on the same machine, even if the server is very busy (as on our learn@wu system). Certainly, it depends on the hardware and the usage patterns.

- concerning running in VMs: i completely agree, for a busy production site, using VMs is not good idea - at least not for beginners (tuning VMs and the hosting environment can help a lot, but there is yet another layer to fiddle around) Usually you have only one cpu visible from inside the VM, so using multiple CPUs (one strength of aolserver + postgres) won't help, so scalability is limited. This is one experience we are going currently through with openacs.org.

have to rush
-gustaf

Collapse
Posted by Gustaf Neumann on
PS: with upcoming processors like POWER7 http://en.wikipedia.org/wiki/POWER7 or Rock http://en.wikipedia.org/wiki/Rock_(processor), we will get 32 simultaneous threads (SMT) per chip! Linux will see this as 32 processors. Cool machines for running OpenACS and PostgreSQL.
Collapse
Posted by Neophytos Demetriou on
True, if the only option (due to the architecture) is to scale up (as opposed to the ability to scale out).
Collapse
Posted by Gustaf Neumann on
i don't see these options as alternatives, but as directions. It is bad if one has multiple cores in a machine is not able to use it. The basic OpenACS infrastructure was built for multiple processing units. The point is simply: the micro-processor designers stopped a few years ago to increase frequencies and to add cores and SMT. These chips arrive now, Aolserver and OpenACS are ready to use them - NOW.

It would be certainly an interesting project to build something like OpenACS based on a P2P-network and DHTs (cassandra, bigtable, ... coral, cfs, oceanstore .... pier), but this will be a different project, since some basic assumption about the data storage are different.

Collapse
Posted by Neophytos Demetriou on
It is also bad if you need an expensive machine to be able to scale (read that as scale linearly). BTW, I would not classify Cassandra as a DHT in the same way that I would not classify PostgreSQL as a file system. The basic assumptions about the programming language and programming style were different for xowiki so I don't see your point about the project talk.
Collapse
Posted by Gustaf Neumann on
The costs of processor per machine went down over the last yeas dramatically. ten years ago, a dual processor machine had the cost of a mainframe (i remember alta-vista vs. google discussions). today, a quadcore processor cost a few hundert dollar. i would expect a similar price for 32 cores in a few years from now.

Sure, DHT is just the basic workhorse, most of the other examples have also quite different properties than e.g. cassandra. What is the problem with the term "project"? xotcl-core and xowiki went out to be compatible with the acs data model. There is nothing wrong in having a project developing an acs package based on p2p technology. Many people will appreciate it. From the scope of the work, it is a project.

Collapse
Posted by Neophytos Demetriou on
Yes, from the scope of the work, it is a project. It's the *different* project part that I did not get the point ;)
Collapse
Posted by Neophytos Demetriou on
Furthermore, different file systems have different properties but that does not make them ACID-compliant databases.

Now, an interesting argument would be the one that makes the case for the use of ACID properties for content-oriented applications and semi-structured data.

Collapse
Posted by Brian Fenton on
Hi everybody,

this is a very interesting discussion.

Gustaf, you said: "by using background delivery, maxthreads can be normally reduced." Can you explain a little bit more how you use background delivery? What's a good case for background delivery, and what isn't? Can you point to any examples?

many thanks
Brian

Collapse
Posted by Neophytos Demetriou on
Hi Brian, if you 've got a lot of long-running threads after a while you can no longer serve any requests. Using background delivery you can delegate the serving of large files to a single thread that manages multiple channels (sockets).

You can do the same thing for resources (css, javascript, images, graphics) using a reverse proxy. The idea is that you should not keep threads loaded with the full blueprint of your installation for these tasks.

The way you use background delivery is similar to ns_returnfile, i.e.

ad_returnfile_background 200 [ns_guesstype ${filename}] ${filename}

Though, you need to make sure that you have tthread installed and the background delivery code that Gustaf wrote.

PS. The NaviServer project went the extra mile of serving requests by loading the procs dynamically --- only those required for a given request. My understanding is that the xorb package does something similar but at a higher level of abstraction.

Collapse
Posted by Gustaf Neumann on
i agree with most of the statements. background delivery is more general than the reverse proxy, since one can do all permission checking etc. as usual in openacs, just the spooling is done via a single thread. This single thread can spool simultaneously several thousands of requests. You can't do that with threads (at least not without modifying the linux kernel). We run into scalability problems with pound (thread based) which were solved by moving to nginx (async delivery).

neophytos, i think, you meant tclthreads and not "tthread" (which sounds close to zoran's ttrace). The c-level code for background delivery is part of naviserver and aolserver 4.5 in the cvs head version. You find patches for aolserver 4.0 and plain 4.5 in the wiki.

xorb is very different from ttrace and is no replacement. ttrace can help to make threads thinner, at the cost of making introspection more complex/wrong. ttrace is not (at least was not) depending on any naviserver code, Zoran wrote it for aolserver. ttrace has XOTcl support and vice versa.

The primary argument in the discussion above: assume you have 10 connection threads configured and you want to server some larger files over some slower lines (600 MB files, delivery might take 2 minutes per request). If 10 people download these files via connection threads at the same time, the server won't be able to serve any requests for these 2 minutes, since tall connection threads are busy; incoming requests will be queued. When you have bgdelivery installed, the connection thread is only used for permission checking and file location on the disk (usually a few milliseconds). This leads to much better scalability.

Collapse
Posted by Neophytos Demetriou on
about nginx: I did that as well but nginx has/had some trouble with the COMET stuff (as opposed to pound that works fine). Still, I'm using nginx.

about tclthreads: The package on TCL's file distribution area (on sf) is called thread-2.6.5 (not sure if that's what you mean). That package also includes ttrace, yes.

about ttrace: NaviServer has that integrated or makes use of it in order to provide tracing as an option. I haven't used that due to some problems that existed between that mechanism and XOTCL back when I tried (as soon as it was released). I remember sending you a message about that but I am not sure what's the current status.

about xorb: yes, I did not mean to imply that it is a replacement but it does serialize/deserialize code from a single thread and it also makes threads thinner, right? (just asking)

Collapse
Posted by Neophytos Demetriou on
The c-level code for background delivery is part of 
naviserver and aolserver 4.5 in the cvs head version. You 
find patches for aolserver 4.0 and plain 4.5 in the wiki.
If you use NaviServer, it's already there.
Collapse
Posted by Neophytos Demetriou on
Please disregard the comment "if you use NaviServer, it's already there". The only way to actually use it is to get the cvs head version :)
Collapse
Posted by Brian Fenton on
Hi Neophytos!
Good to see you on the forums again.

Ok, I got it - it's a drop-in replacement for ns_returnfile. Very useful to know about.

thanks,
Brian

Collapse
Posted by Gustaf Neumann on
Since this thread turns out out be a kind of tutorial, here some completion of the information: ad_returnfile_background is part of xotcl-core and has the following requirements
  1. xotcl-core (sure, if it is defined there)
  2. one needs either a "new" aolserver, or an "old" with the required patches (see discussion above and Boost...), and
  3. one has to have libthread installed (tcl thread library, as described in the request-monitor wiki page)
If you have e.g. only (1), you will have as well a function named ad_returnfile_background, but this will just call ns_returnfile.

Hope this helps -gustaf neumann

Collapse
Posted by Rocael Hernández Rizzardini on
Brian, more info is here: Boost your application performance...
Collapse
Posted by Eduardo Santos on
Hi Gustaf,

Your considerations are very helpful, as they always are. I can have now a more general view about threads X requests behavior. I'll try to make comments based on what you've posted.

- When I said there was a XoTcl problem, I was talking about how the destroy process can be painful depending on the OS + Hardware set. But as you said better, it's not like a XoTcl problem, is more like an AOLServer problem. I guess we agree here.

- I guess I could finally understand what the threads are meant to do. If you have a portion of memory that includes all the procs in the system that can be shared, I can see how 150 is a too large number. Thank you very much for this explanation. I was seeing in production an error that says that maxconnections is exceeded, even with machine resources available. Your explanation shows me that there where a lot of threads in queue waiting for a large download to finish,and background delivery could be a solution for that. I'll think about more testing on that and I'll give everybody a feedback about it.

- About use PostgreSQL and AOLServer in the same box, I've seen in this post here that you run in learn@wu the IBM logical partitions. Is that right? In that case you have no degradation and the system allows you to do everything in the same box without I/O issues. If you consider that a socket connection is quite faster than a TCP/IP one, it's probably a good idea to use the same box. Otherwise, I guess the split is a better option.

That's it. Thank you for the help.

Collapse
Posted by Gustaf Neumann on
Hi Eduardo,

even the mentioned problem (were many thread terminate at the same time) is gone with the actual head version of aolserver.

i am not sure about your question in the last paragraph. Yes, our production server of learn@wu (aolserver + postgres 8.2.*) runs in a single partition of the machine, using 6 of the 8 processors. We have still the same hardware as at the time of my posting with the figures (2005, earlier in the same forum thread), but we are now up to more than 2000 concurrent users, more than 60.000 learning resources, etc. and still below 0.4 seconds per view. The second partition of the server is used for development. The logical partition do not help to make anything faster and are more or less irrelevant in the discussion. We tested configurations with postgres or aolserver on other machines, but these setups were much slower.

As said earlier, if you run out of resources (e.g. cpu power), moving the database to a different (similar) server will certainly help. If one has already enough CPU and memory bandwith, putting the server on a different machine will slow things down.

Collapse
Posted by Eduardo Santos on
Hi Gustaf,

Now I've got what you said. You see: IBM has an OS split mechanism called LPAR or logical partitions, as it's explained here. Using this feature is like if you have a VM, but without the software and hardware limitations that makes it impossible to split completely the I/O between all the machines you have. I thought that was the case for your production system, wich now you say it's not.

If you consider that a socket connection is quite faster than a TCP/IP connection, I can see that having both DB and AOLServer in the same box can be faster if you have enough available resources.

Concerning our tests, however, there's something showing up that can bring more issues to this matter. Linux Kernel has a parameter that sets the maximum amount of shared memory your applications can use in the system, wich is located at /proc/sys/kernel/shmmax. This parameter controls, for example, the amount of memory PostgreSQL can use in the shared buffers. In our test cases, we had the following set of parameters:

## AOLServer parameters
ns_section ns/threads
ns_param mutexmeter true ;# measure lock contention
# The per-thread stack size must be a multiple of 8k for AOLServer to run under MacOS X
ns_param stacksize [expr 1 * 8192 * 256]

ns_section ns/server/${server}
ns_param maxconnections 1000 ;# Max connections to put on queue
ns_param maxdropped 0
ns_param maxthreads 300 ;# Tune this to scale your server
ns_param minthreads 200 ;# Tune this to scale your server
ns_param threadtimeout 120 ;# Idle threads die at this rate
ns_param globalstats false ;# Enable built-in statistics
ns_param urlstats true ;# Enable URL statistics
ns_param maxurlstats 1000 ;# Max number of URL's to do stats on

ns_section ns/db/pool/pool1
# ns_param maxidle 0
# ns_param maxopen 0
ns_param connections 200
ns_param verbose $debug
ns_param extendedtableinfo true
ns_param logsqlerrors $debug
if { $database == "oracle" } {
ns_param driver ora8
ns_param datasource {}
ns_param user $db_name
ns_param password $db_password
} else {
ns_param driver postgres
ns_param datasource ${db_host}:${db_port}:${db_name}
ns_param user $db_user
ns_param password ""
}

ns_section ns/db/pool/pool2
# ns_param maxidle 0
# ns_param maxopen 0
ns_param connections 150
ns_param verbose $debug
ns_param extendedtableinfo true
ns_param logsqlerrors $debug
if { $database == "oracle" } {
ns_param driver ora8
ns_param datasource {}
ns_param user $db_name
ns_param password $db_password
} else {
ns_param driver postgres
ns_param datasource ${db_host}:${db_port}:${db_name}
ns_param user $db_user
ns_param password ""
}

ns_section ns/db/pool/pool3
# ns_param maxidle 0
# ns_param maxopen 0
ns_param connections 150
ns_param verbose $debug
ns_param extendedtableinfo true
ns_param logsqlerrors $debug
if { $database == "oracle" } {
ns_param driver ora8
ns_param datasource {}
ns_param user $db_name
ns_param password $db_password
} else {
ns_param driver postgres
ns_param datasource ${db_host}:${db_port}:${db_name}
ns_param user $db_user
ns_param password ""
}

## Kernel Parameters:
/proc/sys/kernel/shmall 2097152
/proc/sys/kernel/shmmax 2156978176

## PostgreSQL Parameters:
max_connections = 1000 # (change requires restart)
shared_buffers = 2000MB # min 128kB or max_connections*16kB
work_mem = 1MB # min 64kB
max_stack_depth = 5MB # min 100kB

With these values, the memory was corrupted and caused the database to crash under the stress test, wich means that PostgreSQL process did shut down all connections and killed the postmaster process, making the server to show a message and don't shut down or come back.

With all the information I have now, I can see this was probably caused by the huge amount of threads I've put. The new threads where trying to get more and more memory, and the OS just couldn't give it. As both AOLServer and PostgreSQL where theoretically using the same memory area, it caused the DB to corrupt and to kill the process. If they where split in different servers maybe this wouldn't happen, and it made us think that different servers are always the best solution. No matter what you do wrong and what happens to the server, the AOLServer crash will not kill the DB process and DB will also not kill AOLServer too. This could be the safer choice for the case.

Collapse
Posted by Gustaf Neumann on

If you consider that a socket connection is quite faster than a TCP/IP connection ...
Well one has to be more precise. One has to distinguish between a local socket (IPC socket, unix domain socket) and a networking socket (e.g. TCP socket). Local communication is faster (at least it should be) than remote communication.

As discussed above, the minthreads/maxthreads values of the configuration are not reasonable (not sure what you were trying to simulate/measure/...). I have certain doubts, that the memory was "corrupted", but i have no problems to believe that several processes crashed on your system when it run out of memory. For running large applications one should become familiar with system monitors that measure resource consumptions. This way one can spot shortages and adjust the parameters in reasonable ways.