Forum OpenACS Q&A: Re: OpenACS clustering setup and how it relates to xotcl-core.

It was our understanding that the new xotcl may take care of some of the long runners automatically without us defining a specific thread pool and telling xotcl about it. Is this not a good assumption

Well, newer version of the xotcl-request monitor can automatize some steps, but it needs configuration. The xotcl-request monitor can map requests to a "slow" pool when it is defined by the site-admin. The site-admin has also to configure (e.g. min/max threads, rejectoverruns, queue size, ...) this connection pool. From the snippet of the configuration files, it seem, that you have no "slow" pool defined, so no automatic mapping will happen.

From the output i see that the default pool has minthreads 2 and maxthreads 10. I would recommend to define 25/25 on your testing instance based on:

live 24 cores and 80 GB of RAM, staging 16 cores and 200 GB
For a rough estimate, one can run with 24 cores 96 connection threads (assuming typical OpenACS loads) and with 16 about 64.

When looking at the 140 VUs output i see

Request Handling:	requests 1.7K, queued 1K (61.48%), spooled 1K (62.24%)

The number of queued requests is way to high. On our production system (typically between 600 and 3000 concurrent users) we have typically less the 0.01% queued requests in the default pool.

Request Timing:	avg queue time 660.56ms, avg filter time 5.44ms, avg run time 218.25ms avg trace time 1.31ms

You see here that the avg queue time (660ms, time waiting to be served) is much longer the than the processing time (avg run time). On our production system (currently 650 users) we have now

avg queue time 465.04µs, avg filter time 4.18ms, avg run time 104.08ms avg trace time 4.27ms

the output

[13/Sep/2021:16:00:16][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
tells that the queue is filled up (100 entries). The queue size is defined per pool via "maxconnections", we have on our live system this value set to 600. One can fine-tune the behavior of queue overruns via "rejectoverrun". A larger queue makes the system more friendly when there are sometime large request bursts.

While the values above are average values, the values below (long-calls) are from individual requests:

14.371059	0.003934	1.550067	15.925060	2021 Sep 13 16:00:30

The columns are: Queuetime Filtertime Runtime Totaltime.
The requests above were waiting 14secs in the queue before being served, whereas the actual run-time was about 1.5 secs. When you get rid of the queuing time, the performance will improve by a factor of 10. Increasing the threads as indicated above should help significantly.

For this test, setting up the "slow" pool will not improve the benchmark times, since the pools make sure that "slow" requests do not kill the "fast" requests on your server. This is important to ensure liveliness of the server, when the majority of requests is "fast", and only a few are "slow", and you do not want that someone firing e.g. 100 "slow" requests brings you server to its knees.

By increasing the number of maxthreads form 10 (default) to e.g. 50 should increase the throughput by a factor of 5.

Hope, this helps
-gn

Thanks Gustaf,

We appreciate your expertise in helping us tune our system. It has helped a lot!

I was successful in setting up the "slow" pool. I was wondering if there is a setting to only map a request to the slow pool if it is only over 30 or 40 seconds. I found that it started to put our effort reporting page on the slow pool when the system was a little stressed. So I removed the slow pool for the following tests for now.

I have been playing around with the configuration and we have decided to go up to 48 cores on our staging system with the 200 Gig of RAM.

Here is what I found to be somewhat optimal for the 48 cores.

We are not running out of Available connections anymore and the queued request are a lot lower as well as the avg queue times. The K6 tests show that all http request are status 200 as well.

I was wondering if you see any problems with the configuration and if you had suggestions for changes.

Again, thanks for your help

Marty
ns_section ns/server/${server}
ns_param        maxconnections  2000    ;# 100; number of allocated connection structures
ns_param        maxthreads      200     ;# 10; maximal number of connection threads
ns_param        minthreads      200     ;# 1; minimal number of connection threads

ns_param        connsperthread  10000   ;# 10000; number of connections (requests) handled per thread
                                        ;# Setting connsperthread to > 0 will cause the thread to
                                        ;# graciously exit, after processing that many
                                        ;# requests, thus initiating kind-of Tcl-level
                                        ;# garbage collection.

# ns_param      threadtimeout   120     ;# 120; timeout for idle threads.
                                        ;# In case, minthreads < maxthreads, threads
                                        ;# are shutdown after this idle time until
                                        ;# minthreads are reached.

# ns_param      lowwatermark    10      ;# 10; create additional threads above this queue-full percentage
ns_param        highwatermark   100     ;# 80; allow concurrent creates above this queue-is percentage
                                        ;# 100 means to disable concurrent creates
ns_param        rejectoverrun   true   ;# If set to true; NaviServer sends on queue overruns an error "503 service unavailable".
                                        ;#  The request is not added to the waiting list.

# Compress response character data: ns_return, ADP etc.
#
ns_param        compressenable  off     ;# false, use "ns_conn compress" to override

ns_section ns/server/${server}/pools
ns_param        monitor         "Monitoring actions to check the health of the system"
ns_param        image           "Image file requests"
#ns_param        slow            "Slow requests"
ns_param        fast            "Fast requests"
ns_section ns/server/${server}/pool/monitor
ns_param        minthreads 3
ns_param        maxthreads 5

ns_param        map "GET /admin/nsstats"
ns_param        map "GET /SYSTEM"
ns_param        map "GET /ds"
ns_param        map "POST /ds"
ns_section ns/server/${server}/pool/image
ns_param        minthreads 3
ns_param        maxthreads 5

ns_param        map "GET /*.png"
ns_param        map "GET /*.jpg"
ns_param        map "GET /*.PNG"
ns_param        map "GET /*.JPG"
ns_param        map "GET /*.JPEG"
ns_param        map "GET /*.jpeg"
ns_param        map "GET /*.mp4"
ns_param        map "GET /*.MP4"
ns_param        map "GET /*.pdf"
ns_param        map "GET /*.PDF"
ns_param        map "GET /*.gif"
ns_param        map "GET /*.GIF"

#ns_section ns/server/${server}/pool/slow
#ns_param        minthreads 15
#ns_param        maxthreads 15

ns_section ns/server/${server}/pool/fast
ns_param        minthreads 6
ns_param        maxthreads 6

# Test  w 48 CPU's   200 Gig RAM
#   maxconnections 2000
#   maxthreads 200
#   minthreads 200 
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true



150 VUs
Connection Threads:	min 200 max 200 current 200 idle 151 stopping 0 waiting 0
Request Handling:	requests 1.8K, queued 0 (0.00%), spooled 1K (63.55%)
Request Timing:	avg queue time 161.33µs, avg filter time 78.53ms, avg run time 212.77ms avg trace time 972.63µs

300 VUs
Connection Threads:	min 200 max 200 current 200 idle 157 stopping 0 waiting 0
Request Handling:	requests 5.4K, queued 508 (9.46%), spooled 4K (66.81%)
Request Timing:	avg queue time 295.87ms, avg filter time 1.74s, avg run time 251.96ms avg trace time 1.1ms

150 VUs
Connection Threads:	min 200 max 200 current 200 idle 189 stopping 0 waiting 0
Request Handling:	requests 7.1K, queued 627 (8.80%), spooled 5K (65.20%)
Request Timing:	avg queue time 260.7ms, avg filter time 2s, avg run time 255.49ms avg trace time 1.13ms

70 VUs
Connection Threads:	min 200 max 200 current 200 idle 141 stopping 0 waiting 0
Request Handling:	requests 7.9K, queued 627 (7.98%), spooled 5K (65.70%)
Request Timing:	avg queue time 236.39ms, avg filter time 1.94s, avg run time 264.67ms avg trace time 1.12ms

10 VUs
Connection Threads:	min 200 max 200 current 200 idle 199 stopping 0 waiting 0
Request Handling:	requests 8.4K, queued 627 (7.48%), spooled 6K (66.52%)
Request Timing:	avg queue time 221.7ms, avg filter time 1.85s, avg run time 275.66ms avg trace time 1.13ms


K6 Summary
running (4m09.3s), 000/300 VUs, 1359 complete and 38 interrupted iterations
default ✓ [======================================] 000/300 VUs  4m0s

     ✓ status is 200
     ✗ page succeeded
      ↳  99% — ✓ 5542 / ✗ 8

     █ setup

     █ teardown

     checks.........................: 99.92% ✓ 11092     ✗ 8
     data_received..................: 1.3 GB 5.2 MB/s
     data_sent......................: 7.0 MB 28 kB/s
     http_req_blocked...............: avg=1.46ms   min=1.27µs  med=6.61µs   max=51.42ms p(90)=4.1ms    p(95)=4.87ms
     http_req_connecting............: avg=147.33µs min=0s      med=0s       max=48.09ms p(90)=376.22µs p(95)=442.16µs
   ✗ http_req_duration..............: avg=2.32s    min=4.86ms  med=224.41ms max=24.15s  p(90)=8.5s     p(95)=13.05s
       { expected_response:true }...: avg=2.32s    min=4.86ms  med=224.41ms max=24.15s  p(90)=8.5s     p(95)=13.05s
     http_req_failed................: 0.00%  ✓ 0         ✗ 8355
     http_req_receiving.............: avg=689.98µs min=39.82µs med=179.82µs max=70.57ms p(90)=1.71ms   p(95)=1.89ms
     http_req_sending...............: avg=39.42µs  min=8.56µs  med=35.11µs  max=15.19ms p(90)=66.52µs  p(95)=76.12µs
     http_req_tls_handshaking.......: avg=1.28ms   min=0s      med=0s       max=42.24ms p(90)=3.62ms   p(95)=4.51ms
     http_req_waiting...............: avg=2.32s    min=4.66ms  med=223.99ms max=24.15s  p(90)=8.5s     p(95)=13.05s
     http_reqs......................: 8355   33.512465/s
     iteration_duration.............: avg=23.56s   min=1.64µs  med=22.73s   max=51.14s  p(90)=36.62s   p(95)=38.12s
     iterations.....................: 1359   5.45104/s
     vus............................: 1      min=1       max=300
     vus_max........................: 300    min=300     max=300