Forum OpenACS Development: Re: OpenACS Performance Tests

Collapse
Posted by Eduardo Santos on
Hi Gustaf,

Your considerations are very helpful, as they always are. I can have now a more general view about threads X requests behavior. I'll try to make comments based on what you've posted.

- When I said there was a XoTcl problem, I was talking about how the destroy process can be painful depending on the OS + Hardware set. But as you said better, it's not like a XoTcl problem, is more like an AOLServer problem. I guess we agree here.

- I guess I could finally understand what the threads are meant to do. If you have a portion of memory that includes all the procs in the system that can be shared, I can see how 150 is a too large number. Thank you very much for this explanation. I was seeing in production an error that says that maxconnections is exceeded, even with machine resources available. Your explanation shows me that there where a lot of threads in queue waiting for a large download to finish,and background delivery could be a solution for that. I'll think about more testing on that and I'll give everybody a feedback about it.

- About use PostgreSQL and AOLServer in the same box, I've seen in this post here that you run in learn@wu the IBM logical partitions. Is that right? In that case you have no degradation and the system allows you to do everything in the same box without I/O issues. If you consider that a socket connection is quite faster than a TCP/IP one, it's probably a good idea to use the same box. Otherwise, I guess the split is a better option.

That's it. Thank you for the help.

Collapse
Posted by Gustaf Neumann on
Hi Eduardo,

even the mentioned problem (were many thread terminate at the same time) is gone with the actual head version of aolserver.

i am not sure about your question in the last paragraph. Yes, our production server of learn@wu (aolserver + postgres 8.2.*) runs in a single partition of the machine, using 6 of the 8 processors. We have still the same hardware as at the time of my posting with the figures (2005, earlier in the same forum thread), but we are now up to more than 2000 concurrent users, more than 60.000 learning resources, etc. and still below 0.4 seconds per view. The second partition of the server is used for development. The logical partition do not help to make anything faster and are more or less irrelevant in the discussion. We tested configurations with postgres or aolserver on other machines, but these setups were much slower.

As said earlier, if you run out of resources (e.g. cpu power), moving the database to a different (similar) server will certainly help. If one has already enough CPU and memory bandwith, putting the server on a different machine will slow things down.

Collapse
Posted by Eduardo Santos on
Hi Gustaf,

Now I've got what you said. You see: IBM has an OS split mechanism called LPAR or logical partitions, as it's explained here. Using this feature is like if you have a VM, but without the software and hardware limitations that makes it impossible to split completely the I/O between all the machines you have. I thought that was the case for your production system, wich now you say it's not.

If you consider that a socket connection is quite faster than a TCP/IP connection, I can see that having both DB and AOLServer in the same box can be faster if you have enough available resources.

Concerning our tests, however, there's something showing up that can bring more issues to this matter. Linux Kernel has a parameter that sets the maximum amount of shared memory your applications can use in the system, wich is located at /proc/sys/kernel/shmmax. This parameter controls, for example, the amount of memory PostgreSQL can use in the shared buffers. In our test cases, we had the following set of parameters:

## AOLServer parameters
ns_section ns/threads
    ns_param   mutexmeter         true      ;# measure lock contention
    # The per-thread stack size must be a multiple of 8k for AOLServer to run under MacOS X
    ns_param   stacksize          [expr 1 * 8192 * 256]

ns_section ns/server/${server} ns_param maxconnections 1000 ;# Max connections to put on queue ns_param maxdropped 0 ns_param maxthreads 300 ;# Tune this to scale your server ns_param minthreads 200 ;# Tune this to scale your server ns_param threadtimeout 120 ;# Idle threads die at this rate ns_param globalstats false ;# Enable built-in statistics ns_param urlstats true ;# Enable URL statistics ns_param maxurlstats 1000 ;# Max number of URL's to do stats on

ns_section ns/db/pool/pool1 # ns_param maxidle 0 # ns_param maxopen 0 ns_param connections 200 ns_param verbose $debug ns_param extendedtableinfo true ns_param logsqlerrors $debug if { $database == "oracle" } { ns_param driver ora8 ns_param datasource {} ns_param user $db_name ns_param password $db_password } else { ns_param driver postgres ns_param datasource ${db_host}:${db_port}:${db_name} ns_param user $db_user ns_param password "" }

ns_section ns/db/pool/pool2 # ns_param maxidle 0 # ns_param maxopen 0 ns_param connections 150 ns_param verbose $debug ns_param extendedtableinfo true ns_param logsqlerrors $debug if { $database == "oracle" } { ns_param driver ora8 ns_param datasource {} ns_param user $db_name ns_param password $db_password } else { ns_param driver postgres ns_param datasource ${db_host}:${db_port}:${db_name} ns_param user $db_user ns_param password "" }

ns_section ns/db/pool/pool3 # ns_param maxidle 0 # ns_param maxopen 0 ns_param connections 150 ns_param verbose $debug ns_param extendedtableinfo true ns_param logsqlerrors $debug if { $database == "oracle" } { ns_param driver ora8 ns_param datasource {} ns_param user $db_name ns_param password $db_password } else { ns_param driver postgres ns_param datasource ${db_host}:${db_port}:${db_name} ns_param user $db_user ns_param password "" }

## Kernel Parameters: /proc/sys/kernel/shmall 2097152 /proc/sys/kernel/shmmax 2156978176

## PostgreSQL Parameters: max_connections = 1000 # (change requires restart) shared_buffers = 2000MB # min 128kB or max_connections*16kB work_mem = 1MB # min 64kB max_stack_depth = 5MB # min 100kB

With these values, the memory was corrupted and caused the database to crash under the stress test, wich means that PostgreSQL process did shut down all connections and killed the postmaster process, making the server to show a message and don't shut down or come back.

With all the information I have now, I can see this was probably caused by the huge amount of threads I've put. The new threads where trying to get more and more memory, and the OS just couldn't give it. As both AOLServer and PostgreSQL where theoretically using the same memory area, it caused the DB to corrupt and to kill the process. If they where split in different servers maybe this wouldn't happen, and it made us think that different servers are always the best solution. No matter what you do wrong and what happens to the server, the AOLServer crash will not kill the DB process and DB will also not kill AOLServer too. This could be the safer choice for the case.

Collapse
Posted by Gustaf Neumann on

If you consider that a socket connection is quite faster than a TCP/IP connection ...
Well one has to be more precise. One has to distinguish between a local socket (IPC socket, unix domain socket) and a networking socket (e.g. TCP socket). Local communication is faster (at least it should be) than remote communication.

As discussed above, the minthreads/maxthreads values of the configuration are not reasonable (not sure what you were trying to simulate/measure/...). I have certain doubts, that the memory was "corrupted", but i have no problems to believe that several processes crashed on your system when it run out of memory. For running large applications one should become familiar with system monitors that measure resource consumptions. This way one can spot shortages and adjust the parameters in reasonable ways.