Forum OpenACS Q&A: Re: OpenACS clustering setup and how it relates to xotcl-core.

Thanks Gustaf,

Sorry for the late reply. I was out on vacation.

We appreciate your help and expertise. Your point is well taken that we should try to fine tune our configuration as much as we can. I will try to describe our setup and testing on our end so we can try to make sense of what we see. I work on the same naviserver instance as Tony Kirkham. I know you have worked with him on some of our setup so that may help you, to some degree, know what type of setup we have.

We did, at one time, have an extra connection thread pool set up for slow runners. However, after recently updating to the newest naviserver with xotcl we do not have one currently. It was our understanding that the new xotcl may take care of some of the long runners automatically without us defining a specific thread pool and telling xotcl about it. Is this not a good assumption? How would we go about telling the new core where our new long running pool(s) resides? Our current pool config is below.

ns_section ns/server/${server}/pools
ns_param        monitor         "Monitoring actions to check the health of the system"
ns_param        image           "Image file requests"

ns_section ns/server/${server}/pool/monitor
ns_param        minthreads 3
ns_param        maxthreads 5
ns_param        map "GET /admin/nsstats"
ns_param        map "GET /SYSTEM"
ns_param        map "GET /ds"
ns_param        map "POST /ds"

ns_section ns/server/${server}/pool/image
ns_param        minthreads 3
ns_param        maxthreads 5
ns_param        map "GET /*.png"
ns_param        map "GET /*.jpg"
ns_param        map "GET /*.PNG"
ns_param        map "GET /*.JPG"
ns_param        map "GET /*.JPEG"
ns_param        map "GET /*.jpeg"
ns_param        map "GET /*.mp4"
ns_param        map "GET /*.MP4"
ns_param        map "GET /*.pdf"
ns_param        map "GET /*.PDF"
ns_param        map "GET /*.gif"
ns_param        map "GET /*.GIF"

Our Server setup:

We have one VM that has 24 cores and 80 Gig of RAM which our live system resides. We also have a staging VM that has 16 cores and 200 Gig of RAM.

The tests below will be run against our staging server.

I have been using a load testing tool called k6 (See: https://k6.io/) to set a base line test for our server and to do a quick 4 minute test.

The base test had the following ramp up options:

export const options = {
    stages: [
      { target: 25, duration: '30s' },			// Ramp up to 25 concurrent users over the 30 seconds 
      { target: 100, duration: '30s' },			// Then ramp up to 100 concurrent users for the next 30 seconds
      { target: 300, duration: '1m' },			// Then ramp up to 300 concurrent users for the next minute
      { target: 100, duration: '1m' },			// Then ramp down to 100 concurrent users for the next minute
      { target: 0, duration: '1m' },				// Finally ramp down to 0 concurrent users for the next minute
    ]
  };

Each k6 session does the following things:
1.	Go to login page
2.	Verify that it got a status 200 from login page
3.	Sleep 1 second
4.	Submit the login form
5.	Verify that it got a status 200
6.	Sleep 3 seconds
7.	Go to a quick loading page (~ under 1 second)
8.	Verify that it got status 200
9.	Sleep 3 seconds
10.	Go to our highly dynamic server side rendered commonly used effort reporting page (normally takes a full 2 seconds to load)
11.	Verify that it got status 200
12.	Sleep 3 seconds

Test #1 on our Staging server

Nsstats over time
------------------------
No Load on system:  3:56pm
Connection Threads:	min 2 max 10 current 2 idle 1 stopping 0 waiting 0
Request Handling:	requests 7, queued 0 (0.00%), spooled 6 (85.71%)
Request Timing:	avg queue time 98.57µs, avg filter time 183.63ms, avg run time 12.91ms avg trace time 3.92ms
Active Requests:	cns6 172.31.6.115 running GET /z1 0.014718 0 .

Starting the test.

~3:58pm
140 VUs (Virtual Users)
Connection Threads:	min 2 max 10 current 10 idle 0 stopping 0 waiting 73
Request Handling:	requests 1.7K, queued 1K (61.48%), spooled 1K (62.24%)
Request Timing:	avg queue time 660.56ms, avg filter time 5.44ms, avg run time 218.25ms avg trace time 1.31ms
Active Requests:	cns1639 172.31.207.40 running GET /z2 5.408894 0 .

~3:59pm
300 VUs
Connection Threads:	min 2 max 10 current 10 idle 0 stopping 0 waiting 90
Request Handling:	requests 3.6K, queued 3K (77.81%), spooled 2K (64.06%)
Request Timing:	avg queue time 1.39s, avg filter time 5.51ms, avg run time 224.59ms avg trace time 1.6ms
Active Requests:	cns3397 172.31.207.40 running GET /z3 0.815387 0 .

~4:00pm
200 VUs
Connection Threads:	min 2 max 10 current 10 idle 0 stopping 0 waiting 75
Request Handling:	requests 4.9K, queued 4K (82.60%), spooled 3K (66.17%)
Request Timing:	avg queue time 1.95s, avg filter time 5.77ms, avg run time 268.55ms avg trace time 1.71ms
Active Requests:	cns4809 172.31.207.40 running GET /z4 0.932554 0 .

4:01pm
49 VUs
Connection Threads:	min 2 max 10 current 10 idle 3 stopping 0 waiting 0
Request Handling:	requests 6.1K, queued 5K (81.76%), spooled 4K (66.21%)
Request Timing:	avg queue time 2.03s, avg filter time 5.84ms, avg run time 289.01ms avg trace time 1.74ms
Active Requests:	cns6038 172.31.207.40 running GET /z5 7.99834 0 .

K6 summary 
------------------
running (4m09.0s), 000/300 VUs, 964 complete and 86 interrupted iterations
default ✓ [======================================] 000/300 VUs  4m0s

     ✗ status is 200
      ↳  99% — ✓ 4051 / ✗ 7
     ✗ page succeeded
      ↳  99% — ✓ 4041 / ✗ 17

     █ setup

     █ teardown

     checks.........................: 99.70% ✓ 8092      ✗ 24         << Note:  We would like be 100% on our page check for status 200 and page succeeded.
     data_received..................: 922 MB 3.7 MB/s
     data_sent......................: 5.2 MB 21 kB/s
     http_req_blocked...............: avg=2.6ms    min=1.48µs   med=7.01µs   max=41.1ms   p(90)=8.31ms   p(95)=12.44ms
     http_req_connecting............: avg=180.19µs min=0s       med=0s       max=34.22ms  p(90)=419.21µs p(95)=465.91µs
   ✗ http_req_duration..............: avg=3.89s    min=441.05µs med=1.15s    max=35.08s   p(90)=11.28s   p(95)=17.46s    << Note:  It would be nice to be under 8 seconds in the 95th percentile.
       { expected_response:true }...: avg=3.89s    min=4.6ms    med=1.15s    max=35.08s   p(90)=11.28s   p(95)=17.4s
     http_req_failed................: 0.11%  ✓ 7         ✗ 6143     << Note:  We would like to have 0% failed http requests.
     http_req_receiving.............: avg=1.16ms   min=51.84µs  med=178.08µs max=41.42ms  p(90)=2.66ms   p(95)=4.52ms
     http_req_sending...............: avg=42.78µs  min=8.58µs   med=40.67µs  max=513.41µs p(90)=76.4µs   p(95)=88.37µs
     http_req_tls_handshaking.......: avg=2.38ms   min=0s       med=0s       max=39.85ms  p(90)=7.75ms   p(95)=11.64ms
     http_req_waiting...............: avg=3.89s    min=294.73µs med=1.15s    max=35.08s   p(90)=11.28s   p(95)=17.46s
     http_reqs......................: 6150   24.695368/s
     iteration_duration.............: avg=32.34s   min=2.31µs   med=31.28s   max=1m6s     p(90)=54.78s   p(95)=56.5s
     iterations.....................: 964    3.870949/s
     vus............................: 1      min=1       max=300
     vus_max........................: 300    min=300     max=300

Error.log file shows we are out of available connections during this test.
--------------------
...
[13/Sep/2021:16:00:16][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:17][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:18][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:18][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:19][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:20][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:21][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
...

Some of the longer calls from request monitor during this time. request-monitor/long-calls?lines=1000

...
14.371059	0.003934	1.550067	15.925060	2021 Sep 13 16:00:30
14.290415	0.006291	1.668398	15.965104	2021 Sep 13 16:00:30
14.062884	0.003386	0.007183	14.073453	2021 Sep 13 16:00:29
14.189928	0.003912	1.581149	15.774989	2021 Sep 13 16:00:29
14.176448	0.004008	0.018152	14.198608	2021 Sep 13 16:00:29
14.314284	0.003587	0.010584	14.328455	2021 Sep 13 16:00:29
14.265956	0.004304	1.653035	15.923295	2021 Sep 13 16:00:29
14.447920	0.005010	1.477352	15.930282	2021 Sep 13 16:00:29
14.803571	0.004733	1.380080	16.188384	2021 Sep 13 16:00:29
14.458517	0.004058	1.835248	16.297823	2021 Sep 13 16:00:29
14.572207	0.006749	1.628151	16.207107	2021 Sep 13 16:00:29
14.449219	0.003139	0.006589	14.458947	2021 Sep 13 16:00:28
14.402088	0.007025	1.514557	15.923670	2021 Sep 13 16:00:28
14.547796	0.005566	1.703462	16.256824	2021 Sep 13 16:00:28
14.278776	0.012663	1.880569	16.172008	2021 Sep 13 16:00:28
14.470381	0.019008	1.793124	16.282513	2021 Sep 13 16:00:28
14.950281	0.003875	1.816067	16.770223	2021 Sep 13 16:00:28
14.547664	0.004792	1.457710	16.010166	2021 Sep 13 16:00:28
14.418724	0.019254	0.008703	14.446681	2021 Sep 13 16:00:28
...

From the data above it looks like we are running out of connections in our connection pool which I believe would cause requests to be put onto the queue to wait until another connection is available.

Note: When we put 5 servers behind a nginx server and the DB on a separate VM our status checks for page success was 100% and our wait times in the 95th percentile were around 7 seconds.

Any suggestions you have that would help us fine tune our configuration for one server would be greatly appreciated.

Thanks for your help, Marty