Forum OpenACS Development: Re: NaviServer "breaks" under high load
first of all, unless you are running the application on a raspberry pi, ~60 users is no load to be concerned of on a reasonable machine, we have servers with several thousand active users (clicking in a time window of 2 minutes). ... but users are not users, when these user fire e.g every second a request taking a minute, every server will run out of resources soon. So, in order to assess the situation of this server properly, more details are needed.
Concerning the configuration parameters:
- maxconnections: having maxthreads higher than maxconnections never makes sense, NaviServer should warn you about this on startup
- maxthreads: 128 is quite a large value! On the setup with 2000 concurrent users, we have in the default pool minthreads and maxthreads set to 25. Make sure to have enough db connections configured in your database setup. How many cores are available in this setup?
- connsperthead: unless you have major data-leaks in the application, increase connsperthread to 10000; this is not relevant for the issue.
- database: You might run out of db-connections (threads will be hanging around idle waiting for db-connections) or simply to slow or to many queries.
- slow clients: when NaviServer is not properly configured, the speed of client connections for uploads and downloads might cause long blocking connection threads. This will not happen, when spooler and writer threads are properly configured.
- setup without multiple thread pools: In case, a site admin knows to have several very slow requests (taking each multiple seconds) the recommended setup is to have multiple thread pools, such that the slow requests are directed to special pools, but no other users are harmed.
The avgwaittime is the average time for obtaining a db-handle, and avgsqltime is the average time of an SQL query.
The same page provides as well information about the configured pools, like e.g. from the default connection thread pool:
When the queue time goes up, you should become alert. A high filter time is an indicator of permission problems, the avg runtime gives hints about the need of splitting connection pools.
Below, you see information about multiple connection pools, where e.g. the "monitor" pools is used on this site for monitoring applications (like munin) and for admins to adjust the setup at runtime when necessary (ds/shell, etc.)
Hope this helps!
~60 users is no load to be concerned of
I've read about your impressive WU-Wien installation...
However, with ~2.500 "total users" (using ]po[ at least once a week for hour logging, as opposed to these ~60 "active users" in the 2 min interval), also the number of "activities" (projects, tickets and tasks) go up more or less linearly. Most pages algorithms with at least n*log(n) complexity, so that overall load rises with at least n^3*log(n) with the number of users, probably even higher.
Right, this didn't make sense...
OK. I've now set maxthreads = minthreads = 32 and highwatermark=100 in order to disable dynamic thread creation.
The server has 8 physical cores. The "hang" occurred during re-spawning threads if I understood it right. So maybe just disabling dynamic thread creation might fix the situation.
We've got 15, 5 and 5 for pools 1, 2 and 3 respectively.
I'll have a detailed look at DB-connection stats next Friday.
It seems "hangs" have occurred so far only Fridays, because users seem to log hours for the week on Friday. Logging hours initiates some complex cache maintenance PG-triggers: Logged hours will create "cost items" which are attached to the respective activity. These items are rolled-up (aggregated) the project-activity-hierarchy towards the main project as part of a cache for the most important financial figures per project.
The server keeps on "hanging".
I've found that my customers has added "DB-queries in a loop" in the ]po[ page for timesheet logging. So this page now takes 3000ms instead of 90ms. So that explains a bit of the pain.
I it would be kind of OK if the server would be slow. However, this does not explain "hanging" to me.
=> Is it possible that the system hangs because of "recursive" queries/sub-queries?
I've installed nsstats 1.7 on my server, but I don't see the "db-pools" option. And the "Process" option is giving me an error: "can't read stats(tracetime): no such element in array". Any idea how I can fix this? We're running:
Built: Nov 19 2015 at 22:18:45
Tcl version: 8.5
I have just noticed the server "hanging" without any activity, with one user on the system apart from myself. So "overload" is not the right term...
When you have such an old server (4.99.8 is 3.5 years old) it is likely that you have just a single connection thread pool. With queries in the range of 3 seconds and a single connection thread pool configured, it is easy to block this pool completely, leading to a queuing situation (which is in the user perception a "hang"). The feature of dynamic connection thread pool mapping was introduced with NaviServer 4.99.15 early last year (see e.g. ). With this one can map slow requests dynamically to an own pool, where such requests might pile up, but they don't block other traffic.
My recommendation is to update NaviServer to a recent version and use dynamic thread pool mapping. Concerning exceptions from nsstats: the NaviServer modules are released in concert with matching versions in the *modules* directories (see e.g. ). there is some tolerance in backward compatibility, but as it looks not so far.
In case, the mapping is not sufficient, and there are more configuration issues, the newer, more detailed statistics can bring more insights.
My last set of changes seemed to have worked, but the problems re-surfaced.
Do I have any other option in order to get statistics on the DB-pools?
Yesterday I have increased DB-connections:
In the config.tcl (8 physical cores):
So again, the server seems to be running fine today.
Thanks for you help!
other option in order to get statistics on the DB-pools?
according to the NEWS file of NaviServer "ns_db stats" was introduced in NaviServer 4.99.9 (jan 2016)