Forum OpenACS Development: Re: OpenACS Performance Tests
I'm still running some other tests, but I've seen some odd thing: the command db_available_pools "" returned this:
pool2 pool3 pool1
It seems like the pool order is reversed, and I have no idea what was causing it. There's also something that is making my test results be a little bit wrong: the repeated query Operation Blocked thing. As the requests are made simultaneously from the same machine, the system is stoping it thinking the same query is being sent. I have to find a way to untoggle this feature so the results can be OK.
I'm still working with your others suggestions and I'll put my results here as soon as I can.
A simple approach to fix the wierd order might be to change ns_section ns/server/${server}/db ns_param pools "*" ns_param defaultpool pool1 to ns_section ns/server/${server}/db ns_param pools pool1,pool2,pool3 ns_param defaultpool pool1 Concerning request monitor: to deactivate the repeated operation and blocking stuff, but keeping the measuring, simplify the method check (in xotcl-request-monitor/tcl/throttle_mod-procs.tcl) to throttle ad_proc check {} { ... } { my get_context return 0 } I think, this should help (add this definition simply to the end of the .procs.tcl file, you can simple delete it from there, when you are done)
Sorry for the late answer, but I was doing some other things and I left the tests a little bit aside. I'm going to have a report really soon about the effects of changing every parameter in the performance.
My question now is much more related to the service itself than the tests environment. I'm facing some performance issue right now that I don't where it comes from. Evry once in a while, about every ten minutes to half hour, I can see a major performance issue. I just can't get no response and the db queries seem to get real slow. Trying o find out what this is about, I"ve found that, in my log, I can see these messages:
[16/Jun/2008:21:58:57][14985.1296435552][-default:720-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1317448032][-default:721-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1401497952][-default:728-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1300638048][-default:736-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1275423072][-default:730-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1174587744][-default:756-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1393092960][-default:705-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1103145312][-default:747-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1334258016][-default:724-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1342663008][-default:726-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1107347808][-default:745-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:57][14985.1292233056][-default:718-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:58][14985.1204005216][-default:744-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:58][14985.1174587744][-thread1174587744-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:58:58][14985.1174587744][-thread1174587744-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 302ms)
[16/Jun/2008:21:58:58][14985.1275423072][-thread1275423072-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:58:59][14985.1191397728][-default:759-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:59][14985.1275423072][-thread1275423072-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 497ms)
[16/Jun/2008:21:58:59][14985.1288030560][-default:733-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:59][14985.1103145312][-thread1103145312-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:58:59][14985.1145170272][-default:713-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:59][14985.1267018080][-default:742-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:58:59][14985.1199802720][-default:702-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:00][14985.1376282976][-default:735-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:00][14985.1103145312][-thread1103145312-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 667ms)
[16/Jun/2008:21:59:02][14985.1325853024][-default:737-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:03][14985.1212410208][-thread1212410208-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:03][14985.1212410208][-thread1212410208-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 233ms)
[16/Jun/2008:21:59:03][14985.1393092960][-thread1393092960-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:04][14985.1393092960][-thread1393092960-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 1050ms)
[16/Jun/2008:21:59:05][14985.1321650528][-default:722-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:07][14985.1199802720][-thread1199802720-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:08][14985.1199802720][-thread1199802720-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 1202ms)
[16/Jun/2008:21:59:10][14985.1170385248][-default:717-] Warning: db_exec: longdb 5 seconds nsdb0 dml dbqd.acs-tcl.tcl.security-procs.sec_update_user_session_
info.update_last_visit
[16/Jun/2008:21:59:11][14985.1149372768][-default:753-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:11][14985.1166182752][-default:716-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:12][14985.1111550304][-default:746-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:18][14985.1292233056][-thread1292233056-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:18][14985.1334258016][-thread1334258016-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:19][14985.1292233056][-thread1292233056-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 936ms)
[16/Jun/2008:21:59:20][14985.1334258016][-thread1334258016-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 2293ms)
[16/Jun/2008:21:59:25][14985.1140967776][-default:731-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:26][14985.1132562784][-default:712-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:29][14985.1254410592][-default:738-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:29][14985.1149372768][-thread1149372768-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:30][14985.1149372768][-thread1149372768-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 1293ms)
[16/Jun/2008:21:59:30][14985.1338460512][-default:725-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:31][14985.1182992736][-default:757-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:34][14985.1161980256][-default:696-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:35][14985.1271220576][-default:719-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:35][14985.1098942816][-default:749-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:36][14985.1317448032][-thread1317448032-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:36][14985.1317448032][-thread1317448032-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 327ms)
[16/Jun/2008:21:59:40][14985.1098942816][-thread1098942816-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:40][14985.1178790240][-default:741-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:41][14985.1098942816][-thread1098942816-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 1219ms)
[16/Jun/2008:21:59:42][14985.1258613088][-default:708-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:42][14985.1153575264][-thread1153575264-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:44][14985.1304840544][-default:734-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:44][14985.1166182752][-thread1166182752-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:44][14985.1119955296][-default:750-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:44][14985.1153575264][-thread1153575264-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 1685ms)
[16/Jun/2008:21:59:46][14985.1208207712][-default:740-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:46][14985.1090537824][-default:727-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:46][14985.1115752800][-default:697-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:46][14985.1136765280][-default:752-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:46][14985.1372080480][-default:698-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:46][14985.1140967776][-thread1140967776-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:46][14985.1262815584][-default:707-] Notice: exiting: timeout waiting for connection
[16/Jun/2008:21:59:46][14985.1166182752][-thread1166182752-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 2535ms)
[16/Jun/2008:21:59:46][14985.1145170272][-thread1145170272-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:48][14985.1140967776][-thread1140967776-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 2978ms)
[16/Jun/2008:21:59:50][14985.1145170272][-thread1145170272-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 4059ms)
[16/Jun/2008:21:59:53][14985.1401497952][-thread1401497952-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:57][14985.1304840544][-thread1304840544-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:21:59:57][14985.1401497952][-thread1401497952-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 4555ms)
[16/Jun/2008:22:00:00][14985.1271220576][-thread1271220576-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:01][14985.1304840544][-thread1304840544-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 4575ms)
[16/Jun/2008:22:00:02][14985.1288030560][-thread1288030560-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:04][14985.1271220576][-thread1271220576-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 4110ms)
[16/Jun/2008:22:00:05][14985.1372080480][-thread1372080480-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:05][14985.1178790240][-thread1178790240-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:08][14985.1288030560][-thread1288030560-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 5566ms)
[16/Jun/2008:22:00:08][14985.1372080480][-thread1372080480-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 2902ms)
[16/Jun/2008:22:00:08][14985.1178790240][-thread1178790240-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 2884ms)
[16/Jun/2008:22:00:14][14985.1208207712][-thread1208207712-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:17][14985.1262815584][-thread1262815584-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:17][14985.1182992736][-thread1182992736-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:17][14985.1262815584][-thread1262815584-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 624ms)
[16/Jun/2008:22:00:18][14985.1208207712][-thread1208207712-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 3466ms)
[16/Jun/2008:22:00:19][14985.1107347808][-thread1107347808-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:20][14985.1090537824][-thread1090537824-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:24][14985.1107347808][-thread1107347808-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 4952ms)
[16/Jun/2008:22:00:24][14985.1090537824][-thread1090537824-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 3785ms)
[16/Jun/2008:22:00:24][14985.1132562784][-thread1132562784-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:25][14985.1086335328][-sched-] Warning: db_exec: longdb 9 seconds nsdb0 dml dbqd.acs-tcl.tcl.security-procs.sec_sweep_sessions.sessions_sw
eep
[16/Jun/2008:22:00:25][14985.1342663008][-thread1342663008-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:26][14985.1086335328][-sched-] Warning: sched: excessive time taken by proc 4 (11 seconds)
[16/Jun/2008:22:00:28][14985.1132562784][-thread1132562784-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 3387ms)
[16/Jun/2008:22:00:28][14985.1191397728][-thread1191397728-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:29][14985.1300638048][-thread1300638048-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:30][14985.1191397728][-thread1191397728-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 1976ms)
[16/Jun/2008:22:00:32][14985.1342663008][-thread1342663008-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 7163ms)
[16/Jun/2008:22:00:35][14985.1325853024][-thread1325853024-] Notice: destroy called, ::bgdelivery ::xotcl::THREAD->destroy (0ms)
[16/Jun/2008:22:00:37][14985.1300638048][-thread1300638048-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 7627ms)
[16/Jun/2008:22:00:37][14985.1325853024][-thread1325853024-] Notice: destroy called, ::throttle ::xotcl::THREAD->destroy (0ms, 2453ms)
All of a sudden the XoTcl get destroied, and I have no idea where it comes from. Can you give some hint about it?
what i can see from your log is that - connection threads time out (most likely due to your threadtimeout settings), it seems they are no fed by new requests. - the "destroy called" messages are not serious. When a connection thread terminates (the thread is destroyed) all the objects it contains are destroyed as well. Since the deletion semantics on the C level are quite tricky, i left the notice calls in. You just see here the messages of the thread proxy objects, XOTcl itself is not destroyed. - a few db-queries seem to be quite slow: 5 seconds nsdb0 dml dbqd.acs-tcl.tcl.security- procs.sec_update_user_session_info.update_last_visit 9 seconds nsdb0 dml dbqd.acs-tcl.tcl.security- procs.sec_sweep_sessions.sessions_sweep sched: excessive time taken by proc 4 (11 seconds) Without more information, it is hard to guess what happens. Is the "performance issue" happen during/after the benchmark or during normal operations? To see, what's happening in your database, use e.g. http://search.cpan.org/dist/pgtop/pgtop http://www.rot13.org/~dpavlin/sysadm.html For a deeper understanding of postgres semantics in particular with checkpoints, see http://www.westnet.com/~gsmith/content/postgresql/
Thank you for your quick answer. I have already installed in my machine ptop and some other tools to monitor PostgreSQL. From the PostgreSQL analysis, it seems like every time I can see these messages in my log, all the DB queries get very slow and ptop shows me that they are waiting to be parsed (they are in waiting state). From this analysis, my first thought is that this destroy objects call is bringing the performance down.
Then, I'm asking myself: why it happens? When I run the benchmark, the system uses all DB connections and threads available. After that, it destroys some of them, what shows me the log messages above. However, when the threads are being destroyed, the system gets so slow that we just can't navigate.
So, there are some thoughts I can get from here:
1 - Is the thread destroy proccess being a performance problem?
2 - Why the DB connections stay in waiting when this destroy is hapenning if I have memory, processor and connections available?
I'll try to take a closer look at the PostgreSQL performance info at this time and try to see something I haven't seem yet. Thank you for your help.
The primary question is to figure out, what resources are running out. Candidates - cpu - amount of memory - memory bandwidth - i/o and who is causing it - aolserver (e.g. lock-contention) - postgres (e.g. checkpoint handling, vacuuming, some complex queries...) is the whole machine in a bad state (e.g. load etc. how does it react on the console) or just the aolserver or the database? how many cpus do you have on your system (fgrep processor /proc/cpuinfo)? In respone to your questions: - i have never seen thread destroy as a performance problem, but under "normal conditions", not all threads end at more or less the same time. Normally, the problem with the thread destroys is just that they might be created quite soon later, thread creation is slow, when the blueprint is large (when you have many packages installed). However, if one has a large blueprint, the thread cleanup has to free all of the contents, which might be a couple of million individual free operations. This might entail as well quite a large number of memory locks. - there are as well many reasons for possible waiting operations. Do you see error message concerning DEADLOCKS in you database? OpenACS 5.4 uses less locks (which were introduced to overcome some problems in PostgreSQL 8.0 and 8.1). -gustaf neumann
the aolserver has the bad behavior, that it is likely that the thread termination parameters (maxconnections, threadtimeout) terminate all connection threads at the same time. This happens, when these threads were started at more or less the same time, which is likely on busy sites or during benchmarks. This mass extinction of threads is performance-wise not a good idea, especially when most of the threads have to be recreated afterwards. Termination and recration of threads are costly operations learning to noticeable delays.
To overcome this problem i have committed a small change to aolserver 4.5 head, that introduces a "spread", a randomization factor for the two mentioned thread termination parameters. The spread will lead to slightly different values for maxconnections and threadtimeout per connection thread (default is +- 20%). When spread is 0 one obtains the original behavior. You might be interested in trying this version.
There is a basic logic problem in the code, but it really only shows up under test conditions. Two real world complicating factors are running very few threads (less than 10) and serving long running requests (slow client or huge data).
My experience is that external benchmark programs are less reliable than AOLserver, so it is very difficult to use their results. Most of the slowness in OpenACS will be in database queries, which will be impossible to detect using an external tool.
OTOH, if you expect hundreds of simultaneous visitors to a DB backed site, be happy with your success and look into distributing your site over a number of front end servers.
Thank you very much for your replies. I'm sorry for the time I've taken to post, but I had some personal issue to solve.
With our tests and observations, we are writing some kind of benchmark howto document that should go somewhere in the community, maybe in the XoWiki instance. Our tests had three branches of observation:
1 - AOLServer (OpenACS?) Tunning
2 - PostgreSQL Tunning
3 - SO Tunning
I'm going to try to make a brief description about the specific and generic observations we could find:
<h4> AOLServer (OpenACS? Tunning) </h4>
Our first issue with AOLServer was the 4.5.0 parameters problem that somebody fixed with the file /acs-tcl/tcl/pools-init.tcl as said in this post in Tom's message. With a simple update we where able to solve this issue.
The other problem was with the thread creation process wich, as you just said, is a mixed XoTcl + AOLServer problem. The most important thing we've realized is that the creation and destruction process consumes a lot of I/O operations. To improve the I/O performance we've tried to change the file system, but it had no effect due to the most important thing we've find out: DON'T EVER USE VM's IN PRODUCTION SITES.
Our server was based on Xen VM's and it was impossible to have I/O performance with Virtual Machines. The whole thing about it is that there's no virtualization process able to split the blocks and inodes completely when using virtual machines, so all the I/O is shared between the VM's in the Cluster. It's a little bit different from what happens with logical partitions possible in some hardwares, such as IBM's big servers. In that case the files are completely separated and the partition works with no I/O issues.
Based on this observation, we've switched the production environment to a dedicated server with the specifications described in the first post of this thread, and most of the problems with threads creation and destruction are gone.
The next step was to adjust the configuration file. I guess the biggest challenge to everybody using OpenACS is the best relation between number of users X maxconnections X requests X maxthreads. This is the part where we are stuck right now. According to this Gustaf's Post on AOLServer list:
the question certainly is, what "concurrent users" means (...) note, that when talking about "concurrent users", it might be the case that one user has many requests concurrently running. To simplify things, we should talk about e.g. 10 concurrent requests.
Then you need
- at least 10 threads
- at least 10 connections
The number of database connections is more difficult to estimate, since
this in an application(oacs) matter. In oacs, most dynamic requests need
1 or 2 db connections concurrently.
I know this is not like a cookbook, but using his observation we could think about one thread, one connection and 2 db connections for each request. With these parameters, concerning that the tests are going to perform all the requests we configure at the same time, the results are most likely following this logic. When we set the maxthreads to 150, there's a little bit of degradation in memory trying to serve 100 simultaneous requests. This degradation is over when you set the parameters minthreads to 80 and maxthreads to 120. What we can get from here is that one thread is able to serve one request in the best performance adjustments.
However, when you send these setting to production, there's maybe the most difficult problem: to estimate the number of requests per user. Maybe one patch in xotcl-request-monitor could answer this question, but we are still thinking about the best way to do it. We could also see that the number of connections per user is very different from the relation 1 X 1, and this another thing we are trying to find the best relation.
<h4> PostgreSQL Tunning </h4>
All the tests we are performing until now consider DB and OpenACS in the same box. The goal is to find out the performance limit to this setting, so we can remove PostgreSQL from the machine and measure how it gets better.
There's a lot of material for PostgreSQL in the Internet, and I'm not one specialist myself, but I guess there are some specific observations that could be done.
Everybody says that Internet applications are doing most of the time SELECT operations in the database. This is a myth. If you consider number of operations maybe it can be true, but it's not when you consider execution time and resources usage. The number of TPS (Transactions Per Second) in an application using OpenACS is very large, and that's a good thing. In the old Internet you where creating content to people see. The new paradigm says the opposite: now the users create the content, and that's why I use OpenACS. There's just no better option to build social and collaborative networks in the market.
Concerning this matter, most of PostgreSQL tuning involves transaction improvements. I can't still understand completely the pools mechanism that OpenACS uses for PostgreSQL, and I guess some improvement in this area could make my tests better.
The most important thing to adjust here is the shared memory mechanism. We've seen that, if you put a too big number in PostgreSQL and OS, the memory shared between PostgreSQL and AOLServer can cause the system to crash under stress situations, and that's not a good thing. The I/O becomes a problem with a large number of INSERT and DELETE operations, mostly because the thread creation process is also heavy for the system.
The conclusion is: if you want to have the best performance, you really have to split AOLServer and PostgreSQL in different boxes. The exact point to do it (DB size, number of users) is what we are trying to find out.
<h4> OS Tuning </h4>
Maybe this is the most difficult part to be done, because Linux has a lot of options that can be changed. A better analysis concerning resource usage is necessary so we can have better numbers. I'm going to put here a list of parameters we are changing in a Debian GNU/Linux Etch system:
# echo 2 > /proc/sys/vm/overcommit_memory # echo 27910635520 > /proc/sys/kernel/shmmax # echo 32707776 > /proc/sys/kernel/shmall # echo deadline > /sys/block/sda/queue/scheduler # echo 250 32000 100 128 > /proc/sys/kernel/sem # cat /proc/sys/fs/file-max 753884 # echo 16777216 > /proc/sys/net/core/rmem_default # echo 16777216 > /proc/sys/net/core/wmem_default # echo 16777216 > /proc/sys/net/core/wmem_max # echo 16777216 > /proc/sys/net/core/rmem_max# su - postgres postgres@nodo406:~$ ulimit 753883
There are some other things, such as Kernel changes and TCP/IP configuration we are changing, but I don't think we have the better adjusts yet, so let's wait a little longer.
That's it. Sorry about the long post, but I guess I should give the community some qualified feedback, concerning all the help you guys always give to us.
- The last change in /acs-tcl/tcl/pools-init.tcl was 15 months ago. You might have been affected by change in the default pool order, http://fisheye.openacs.org/browse/OpenACS/openacs-4/etc/config.tcl?r1=1.47&r2=1.48
- i don't see an kind of "XOTcl problem" in this discussion (maybe i am overly sensible in this question). The only "problem" is see is that the xotcl-core writes a message into the error log, when aolserver threads exit. They would exit as well without XOTcl being involved.
- setting maxthreads to 150 is very high. We use on our production site (more than 2200 concurrent users, up to 120 views per second) maxthreads of 100. by using background delivery, maxthreads can be normally reduced. keep in mind that ever connection thread (or thread for scheduled procs) contains a full blueprint of all used packages, these are 5.000-10.000 procs). In other words, for every thread there is a separate copy of all procs in memory. We are talking about 500.000 to maybe more than one million (!) tcl procs, when 100 threads are configured. This is not lightweight. When all threads go down (as sketched in my earlier posting), all these procs are deleted and maybe recreated immediately in new threads.
- "difficult problem ... estimate the number of requests per user": The request monitor gives you on the start page the actual number of users and the actual number of requests per minute (in the graph), and the actual views/sec and the avg. view time per user. If you have an average view time (AVT) per user of say 62 seconds, then the views per user (VPU) per minute is VPU = 60/AVT = 0.96.
Therefore, 100 users generate U*VPU requests per minute (with the sample values: 96). Getting estimates on Requests per user is just the simple part. More complex is to figure out, how many threads one will need for that. One needs the average processing time, which is unfortunately depending in reality, how many other requests are currently running (all requests share common resources and are therefore not independent).
- Anyhow, the request monitor shows you as well the number of currently running requests, and by observing this value one can estimate the number of required threads. If you have eg. 10 connection threads configured, and you observe that often 8 ore more requests are running, you are already running into limits. Once all connection threads are busy, the newly incoming requests are queued, causing further delays. Even simple requests (like e.g. a logo) can take significant time, since the response time to the user is queuing time + processing time. If the QT is 10 seconds, every request will take at least 10 seconds. If the machine was not able to process the incoming requests quickly enough with the configured threads in the past than it is questionable, when it will be able to catch up. So, queuing requests should be avoided. This is also one place, where the background delivery helps since it frees connection threads much earlier to do other work. On the other hand, if you have e.g. 30 connection threads configured, and you see normally only 2 or three concurrent requests, you are most probably wasting resources.
- i do not agree with the conclusion "for best performance, you really have to split AOLServer and PostgreSQL" to different servers. It depends on the machines. Maybe you are correct with dual or quad-core Intel processors. If the database and aolserver are on the same machine, and you have enough memory and your have enough CPU powers (e.g. 8 cores or more) and your machine has a decent memory-throughput/latency, then local communication between database and aolserver is significantly faster. There are situations, where performance is better, when database and aolserver are on the same machine, even if the server is very busy (as on our learn@wu system). Certainly, it depends on the hardware and the usage patterns.
- concerning running in VMs: i completely agree, for a busy production site, using VMs is not good idea - at least not for beginners (tuning VMs and the hosting environment can help a lot, but there is yet another layer to fiddle around) Usually you have only one cpu visible from inside the VM, so using multiple CPUs (one strength of aolserver + postgres) won't help, so scalability is limited. This is one experience we are going currently through with openacs.org.
have to rush
-gustaf
It would be certainly an interesting project to build something like OpenACS based on a P2P-network and DHTs (cassandra, bigtable, ... coral, cfs, oceanstore .... pier), but this will be a different project, since some basic assumption about the data storage are different.
Sure, DHT is just the basic workhorse, most of the other examples have also quite different properties than e.g. cassandra. What is the problem with the term "project"? xotcl-core and xowiki went out to be compatible with the acs data model. There is nothing wrong in having a project developing an acs package based on p2p technology. Many people will appreciate it. From the scope of the work, it is a project.
Now, an interesting argument would be the one that makes the case for the use of ACID properties for content-oriented applications and semi-structured data.
this is a very interesting discussion.
Gustaf, you said: "by using background delivery, maxthreads can be normally reduced." Can you explain a little bit more how you use background delivery? What's a good case for background delivery, and what isn't? Can you point to any examples?
many thanks
Brian
You can do the same thing for resources (css, javascript, images, graphics) using a reverse proxy. The idea is that you should not keep threads loaded with the full blueprint of your installation for these tasks.
The way you use background delivery is similar to ns_returnfile, i.e.
ad_returnfile_background 200 [ns_guesstype ${filename}] ${filename}
Though, you need to make sure that you have tthread installed and the background delivery code that Gustaf wrote.
PS. The NaviServer project went the extra mile of serving requests by loading the procs dynamically --- only those required for a given request. My understanding is that the xorb package does something similar but at a higher level of abstraction.
neophytos, i think, you meant tclthreads and not "tthread" (which sounds close to zoran's ttrace). The c-level code for background delivery is part of naviserver and aolserver 4.5 in the cvs head version. You find patches for aolserver 4.0 and plain 4.5 in the wiki.
xorb is very different from ttrace and is no replacement. ttrace can help to make threads thinner, at the cost of making introspection more complex/wrong. ttrace is not (at least was not) depending on any naviserver code, Zoran wrote it for aolserver. ttrace has XOTcl support and vice versa.
The primary argument in the discussion above: assume you have 10 connection threads configured and you want to server some larger files over some slower lines (600 MB files, delivery might take 2 minutes per request). If 10 people download these files via connection threads at the same time, the server won't be able to serve any requests for these 2 minutes, since tall connection threads are busy; incoming requests will be queued. When you have bgdelivery installed, the connection thread is only used for permission checking and file location on the disk (usually a few milliseconds). This leads to much better scalability.
about tclthreads: The package on TCL's file distribution area (on sf) is called thread-2.6.5 (not sure if that's what you mean). That package also includes ttrace, yes.
about ttrace: NaviServer has that integrated or makes use of it in order to provide tracing as an option. I haven't used that due to some problems that existed between that mechanism and XOTCL back when I tried (as soon as it was released). I remember sending you a message about that but I am not sure what's the current status.
about xorb: yes, I did not mean to imply that it is a replacement but it does serialize/deserialize code from a single thread and it also makes threads thinner, right? (just asking)
The c-level code for background delivery is part of naviserver and aolserver 4.5 in the cvs head version. You find patches for aolserver 4.0 and plain 4.5 in the wiki.If you use NaviServer, it's already there.
Good to see you on the forums again.
Ok, I got it - it's a drop-in replacement for ns_returnfile. Very useful to know about.
thanks,
Brian
ad_returnfile_background
is part of xotcl-core
and has the following requirements
- xotcl-core (sure, if it is defined there)
- one needs either a "new" aolserver, or an "old" with the required patches (see discussion above and Boost...), and
- one has to have libthread installed (tcl thread library, as described in the request-monitor wiki page)
ad_returnfile_background
, but this will just call ns_returnfile
.
Hope this helps -gustaf neumann
Your considerations are very helpful, as they always are. I can have now a more general view about threads X requests behavior. I'll try to make comments based on what you've posted.
- When I said there was a XoTcl problem, I was talking about how the destroy process can be painful depending on the OS + Hardware set. But as you said better, it's not like a XoTcl problem, is more like an AOLServer problem. I guess we agree here.
- I guess I could finally understand what the threads are meant to do. If you have a portion of memory that includes all the procs in the system that can be shared, I can see how 150 is a too large number. Thank you very much for this explanation. I was seeing in production an error that says that maxconnections is exceeded, even with machine resources available. Your explanation shows me that there where a lot of threads in queue waiting for a large download to finish,and background delivery could be a solution for that. I'll think about more testing on that and I'll give everybody a feedback about it.
- About use PostgreSQL and AOLServer in the same box, I've seen in this post here that you run in learn@wu the IBM logical partitions. Is that right? In that case you have no degradation and the system allows you to do everything in the same box without I/O issues. If you consider that a socket connection is quite faster than a TCP/IP one, it's probably a good idea to use the same box. Otherwise, I guess the split is a better option.
That's it. Thank you for the help.
even the mentioned problem (were many thread terminate at the same time) is gone with the actual head version of aolserver.
i am not sure about your question in the last paragraph. Yes, our production server of learn@wu (aolserver + postgres 8.2.*) runs in a single partition of the machine, using 6 of the 8 processors. We have still the same hardware as at the time of my posting with the figures (2005, earlier in the same forum thread), but we are now up to more than 2000 concurrent users, more than 60.000 learning resources, etc. and still below 0.4 seconds per view. The second partition of the server is used for development. The logical partition do not help to make anything faster and are more or less irrelevant in the discussion. We tested configurations with postgres or aolserver on other machines, but these setups were much slower.
As said earlier, if you run out of resources (e.g. cpu power), moving the database to a different (similar) server will certainly help. If one has already enough CPU and memory bandwith, putting the server on a different machine will slow things down.
Now I've got what you said. You see: IBM has an OS split mechanism called LPAR or logical partitions, as it's explained here. Using this feature is like if you have a VM, but without the software and hardware limitations that makes it impossible to split completely the I/O between all the machines you have. I thought that was the case for your production system, wich now you say it's not.
If you consider that a socket connection is quite faster than a TCP/IP connection, I can see that having both DB and AOLServer in the same box can be faster if you have enough available resources.
Concerning our tests, however, there's something showing up that can bring more issues to this matter. Linux Kernel has a parameter that sets the maximum amount of shared memory your applications can use in the system, wich is located at /proc/sys/kernel/shmmax. This parameter controls, for example, the amount of memory PostgreSQL can use in the shared buffers. In our test cases, we had the following set of parameters:
## AOLServer parameters ns_section ns/threads ns_param mutexmeter true ;# measure lock contention # The per-thread stack size must be a multiple of 8k for AOLServer to run under MacOS X ns_param stacksize [expr 1 * 8192 * 256]ns_section ns/server/${server} ns_param maxconnections 1000 ;# Max connections to put on queue ns_param maxdropped 0 ns_param maxthreads 300 ;# Tune this to scale your server ns_param minthreads 200 ;# Tune this to scale your server ns_param threadtimeout 120 ;# Idle threads die at this rate ns_param globalstats false ;# Enable built-in statistics ns_param urlstats true ;# Enable URL statistics ns_param maxurlstats 1000 ;# Max number of URL's to do stats on
ns_section ns/db/pool/pool1 # ns_param maxidle 0 # ns_param maxopen 0 ns_param connections 200 ns_param verbose $debug ns_param extendedtableinfo true ns_param logsqlerrors $debug if { $database == "oracle" } { ns_param driver ora8 ns_param datasource {} ns_param user $db_name ns_param password $db_password } else { ns_param driver postgres ns_param datasource ${db_host}:${db_port}:${db_name} ns_param user $db_user ns_param password "" }
ns_section ns/db/pool/pool2 # ns_param maxidle 0 # ns_param maxopen 0 ns_param connections 150 ns_param verbose $debug ns_param extendedtableinfo true ns_param logsqlerrors $debug if { $database == "oracle" } { ns_param driver ora8 ns_param datasource {} ns_param user $db_name ns_param password $db_password } else { ns_param driver postgres ns_param datasource ${db_host}:${db_port}:${db_name} ns_param user $db_user ns_param password "" }
ns_section ns/db/pool/pool3 # ns_param maxidle 0 # ns_param maxopen 0 ns_param connections 150 ns_param verbose $debug ns_param extendedtableinfo true ns_param logsqlerrors $debug if { $database == "oracle" } { ns_param driver ora8 ns_param datasource {} ns_param user $db_name ns_param password $db_password } else { ns_param driver postgres ns_param datasource ${db_host}:${db_port}:${db_name} ns_param user $db_user ns_param password "" }
## Kernel Parameters: /proc/sys/kernel/shmall 2097152 /proc/sys/kernel/shmmax 2156978176
## PostgreSQL Parameters: max_connections = 1000 # (change requires restart) shared_buffers = 2000MB # min 128kB or max_connections*16kB work_mem = 1MB # min 64kB max_stack_depth = 5MB # min 100kB
With these values, the memory was corrupted and caused the database to crash under the stress test, wich means that PostgreSQL process did shut down all connections and killed the postmaster process, making the server to show a message and don't shut down or come back.
With all the information I have now, I can see this was probably caused by the huge amount of threads I've put. The new threads where trying to get more and more memory, and the OS just couldn't give it. As both AOLServer and PostgreSQL where theoretically using the same memory area, it caused the DB to corrupt and to kill the process. If they where split in different servers maybe this wouldn't happen, and it made us think that different servers are always the best solution. No matter what you do wrong and what happens to the server, the AOLServer crash will not kill the DB process and DB will also not kill AOLServer too. This could be the safer choice for the case.
Well one has to be more precise. One has to distinguish between a local socket (IPC socket, unix domain socket) and a networking socket (e.g. TCP socket). Local communication is faster (at least it should be) than remote communication.
If you consider that a socket connection is quite faster than a TCP/IP connection ...
As discussed above, the minthreads/maxthreads values of the configuration are not reasonable (not sure what you were trying to simulate/measure/...). I have certain doubts, that the memory was "corrupted", but i have no problems to believe that several processes crashed on your system when it run out of memory. For running large applications one should become familiar with system monitors that measure resource consumptions. This way one can spot shortages and adjust the parameters in reasonable ways.