Forum OpenACS Q&A: Re: OpenACS clustering setup and how it relates to xotcl-core.

Hi Marty,

wondering ... there is a setting to only map a request to the slow pool if it is only over 30 or 40 seconds

Well, there should be absolutely NO request processed in the foreground taking longer than say 10 secs. Currently, the various thresholds are hard-coded in the request monitor (mapping to slow pool, warnings for queue-times and filter times, ...., mapping to color-schemes) based on best practice of several large systems, but it is certainly possible to make these configurable in the future. The threshold for slow-pool mapping is 3 seconds. Note, that in case, your server has many such requests (which can't be made faster), one should define a larger "slow" pool. Requests in the "slow" pool are processed with the same efficiency as requests in a "fast" pool ("slow" and "fast" are just names), one can give the slow pool as many threads as you like.

I was wondering if you see any problems with the configuration and if you had suggestions for changes.

You have currently:

            min max
   default  200 200
   monitor    3   5
   image      3   5
   slow      15  15
   fast       6   6

200 min threads and 200 max threads for the default pool is a lot, but certainly, the right values depend on the application and on the machine you are using. When your jobs are CPU-intense, then it can happen that concurrent jobs slow each other more down than it helps. Many processors have a degressive clock rate (when many CPUs are busy, the temperature in the CPU goes up and the clock rate is reduced). Choosing the right CPU-type, is not so easy, often deactivating hyper-threading can speed up things (don't believe your vendor), etc. .... but this is a long story and brings us away from your question.

When there is the DB + nsd on the same machine, don't forget that the DB needs as well a lot of resources (CPU and memory). Every (active) thread will require a DB-connection, and the resources available per SQL query (postgres process) will be less, the SQL queries become slower, etc.

My recommendation is: if you have multiple connection pools, and the default pool is responsible for most of the hard work, don't make this much larger than the number of cores you have. When you have long running queries, make sure, you can't handle these via spooling or background processing.

From your configuration, i see a pool "fast", which is probably unused (and can be removed), since there is no mapping defined. I would also recommend to use the "image" pool minthreads=maxthreads. You should see from the per-pool statistics how many threads you need. Here are is the output from our "fast" pool used for images (with the given load, 6 threads can server 210 requests in this pool per second).

Connection Threads: min 6 max 6 current 6 idle 6 stopping 0 waiting 0 started 23 maxconnections 600
Request Handling:   requests 194.5K, queued 149 (0.08%), spooled 124K (63.55%), dropped 0 possible-max-reqs 210.5rps
Request Timing: avg queue time 239.75µs, avg filter time 4.69ms, avg run time 22.1ms avg trace time 1.72ms

We have on our production environment currently:

            min max
   default   60  60
   bots       3   3
   monitor    2   4
   fast       6   6
   slow      15  15

What i can see from your figures is that the filter time is very high, already for 150 VUs:

Request Timing: avg queue time 161.33µs, avg filter time 78.53ms, avg run time 212.77ms avg trace time 972.63µs

Typically, the filters are responsible for authorization and setup.... but seeing there 78ms, and for 300 VUs 1.7s makes me wonder, what's going on. Maybe you are running out of DB connections (as configured in the NaviServer configuration file). Or you are running out of SQL connection from you DB configuration.

Check out the section DB-Pools in nsstats. It should look like the snipped below.

pool1:  statements 99M gethandles 2.8M handles 170 connected 64 used 135 waittime 3.872729 sqltime 82712.668212 avgwaittime 1.4µs avgsqltime 835.8µs
pool2:  statements 0 gethandles 2 handles 2 connected 0 used 1 waittime 0.002325 sqltime 0.000000 avgwaittime 1.2ms
pool3:  statements 0 gethandles 2 handles 2 connected 0 used 1 waittime 0.002555 sqltime 0.000000 avgwaittime 1.3ms

If avgwaittime is high, you have the problem. Newer OpenACS versions don't use pool2 and pool3 (for nested SQL calls, which were troublesome for rollbacks).

Hope this helps!

-g

Thanks Gustaf,

You were right the avgwaittime on pool1 was high

pool1:	statements 1.1M gethandles 8.6K handles 15 connected 0 used 15 waittime 14693.460994 sqltime 1724.604281 avgwaittime 1.7s avgsqltime 1.5ms

The pool1 connections were still at 15. I increased them to 70 and brought our maxthreads down to 60.

Here is my test with this configuration

# 48 cores   200 Gig RAM
#   maxconnections 2000
#   maxthreads 60
#   minthreads 60
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true
#    image pool 6/6
# ns_section ns/db/pool/pool1
	ns_param        connections        70

150 VUs Connection Threads: min 60 max 60 current 60 idle 33 stopping 0 waiting 0 Request Handling: requests 2.4K, queued 0 (0.00%), spooled 2K (64.39%) Request Timing: avg queue time 167.94µs, avg filter time 4.87ms, avg run time 231.84ms avg trace time 996.02µs

pool1: statements 288.5K gethandles 2.6K handles 70 connected 5 used 34 waittime 0.477450 sqltime 447.052551 avgwaittime 186.3µs avgsqltime 1.5ms

300 VUs Connection Threads: min 60 max 60 current 60 idle 18 stopping 0 waiting 0 Request Handling: requests 7.4K, queued 2K (21.37%), spooled 5K (64.86%) Request Timing: avg queue time 87.8ms, avg filter time 8.85ms, avg run time 333.63ms avg trace time 3.23ms

pool1: statements 923.4K gethandles 7.6K handles 70 connected 17 used 61 waittime 0.998778 sqltime 1997.236181 avgwaittime 132.1µs avgsqltime 2.2ms

150 VUs Connection Threads: min 60 max 60 current 60 idle 7 stopping 0 waiting 0 Request Handling: requests 11.3K, queued 3K (29.84%), spooled 7K (66.01%) Request Timing: avg queue time 324.46ms, avg filter time 9.84ms, avg run time 416.1ms avg trace time 3.87ms

pool1: statements 1.5M gethandles 11.5K handles 70 connected 7 used 62 waittime 1.052769 sqltime 3914.273814 avgwaittime 91.8µs avgsqltime 2.5ms

70 VUs Connection Threads: min 60 max 60 current 60 idle 57 stopping 0 waiting 0 Request Handling: requests 13K, queued 3K (25.79%), spooled 9K (66.52%) Request Timing: avg queue time 280.5ms, avg filter time 9.11ms, avg run time 414.92ms avg trace time 3.51ms

pool1: statements 1.8M gethandles 13.2K handles 70 connected 57 used 62 waittime 1.054629 sqltime 4429.716496 avgwaittime 79.6µs avgsqltime 2.4ms

K6 Summary running (4m09.2s), 000/300 VUs, 2225 complete and 0 interrupted iterations default ✓ [======================================] 000/300 VUs 4m0s ✗ status is 200 ↳ 99% — ✓ 8889 / ✗ 1 ✗ page succeeded ↳ 99% — ✓ 8886 / ✗ 4 checks.........................: 99.97% ✓ 17775 ✗ 5 data_received..................: 2.1 GB 8.3 MB/s data_sent......................: 11 MB 44 kB/s http_req_blocked...............: avg=2.35ms min=1.27µs med=4.83µs max=1.85s p(90)=4.9ms p(95)=7.77ms http_req_connecting............: avg=762.21µs min=0s med=0s max=1.84s p(90)=340.97µs p(95)=392.74µs ✓ http_req_duration..............: avg=699.68ms min=4.73ms med=43.58ms max=5.8s p(90)=2.63s p(95)=3.23s { expected_response:true }...: avg=699.74ms min=4.73ms med=43.59ms max=5.8s p(90)=2.63s p(95)=3.23s http_req_failed................: 0.00% ✓ 1 ✗ 13342 http_req_receiving.............: avg=927.2µs min=42.66µs med=167.63µs max=1.63s p(90)=1.68ms p(95)=2.29ms http_req_sending...............: avg=32.23µs min=7.95µs med=28.46µs max=472.34µs p(90)=58.33µs p(95)=66.47µs http_req_tls_handshaking.......: avg=1.56ms min=0s med=0s max=37.52ms p(90)=4.48ms p(95)=7.27ms http_req_waiting...............: avg=698.72ms min=4.45ms med=43.41ms max=5.8s p(90)=2.63s p(95)=3.23s http_reqs......................: 13343 53.54756/s iteration_duration.............: avg=14.18s min=1.74µs med=12.87s max=22.24s p(90)=18.83s p(95)=19.99s iterations.....................: 2225 8.929275/s vus............................: 1 min=1 max=300 vus_max........................: 300 min=300 max=300

These numbers look a lot better! If you see anything that seems amiss let me know. We appreciate all of your help and expertise.

Sincerely,
Marty