Forum OpenACS Q&A: Re: OpenACS clustering setup and how it relates to xotcl-core.
Sorry for the late reply. I was out on vacation.
We appreciate your help and expertise. Your point is well taken that we should try to fine tune our configuration as much as we can. I will try to describe our setup and testing on our end so we can try to make sense of what we see. I work on the same naviserver instance as Tony Kirkham. I know you have worked with him on some of our setup so that may help you, to some degree, know what type of setup we have.
We did, at one time, have an extra connection thread pool set up for slow runners. However, after recently updating to the newest naviserver with xotcl we do not have one currently. It was our understanding that the new xotcl may take care of some of the long runners automatically without us defining a specific thread pool and telling xotcl about it. Is this not a good assumption? How would we go about telling the new core where our new long running pool(s) resides? Our current pool config is below.
ns_section ns/server/${server}/pools ns_param monitor "Monitoring actions to check the health of the system" ns_param image "Image file requests" ns_section ns/server/${server}/pool/monitor ns_param minthreads 3 ns_param maxthreads 5 ns_param map "GET /admin/nsstats" ns_param map "GET /SYSTEM" ns_param map "GET /ds" ns_param map "POST /ds" ns_section ns/server/${server}/pool/image ns_param minthreads 3 ns_param maxthreads 5 ns_param map "GET /*.png" ns_param map "GET /*.jpg" ns_param map "GET /*.PNG" ns_param map "GET /*.JPG" ns_param map "GET /*.JPEG" ns_param map "GET /*.jpeg" ns_param map "GET /*.mp4" ns_param map "GET /*.MP4" ns_param map "GET /*.pdf" ns_param map "GET /*.PDF" ns_param map "GET /*.gif" ns_param map "GET /*.GIF"Our Server setup:
We have one VM that has 24 cores and 80 Gig of RAM which our live system resides. We also have a staging VM that has 16 cores and 200 Gig of RAM.
The tests below will be run against our staging server.I have been using a load testing tool called k6 (See: https://k6.io/) to set a base line test for our server and to do a quick 4 minute test.
The base test had the following ramp up options:export const options = { stages: [ { target: 25, duration: '30s' }, // Ramp up to 25 concurrent users over the 30 seconds { target: 100, duration: '30s' }, // Then ramp up to 100 concurrent users for the next 30 seconds { target: 300, duration: '1m' }, // Then ramp up to 300 concurrent users for the next minute { target: 100, duration: '1m' }, // Then ramp down to 100 concurrent users for the next minute { target: 0, duration: '1m' }, // Finally ramp down to 0 concurrent users for the next minute ] };
Each k6 session does the following things: 1. Go to login page 2. Verify that it got a status 200 from login page 3. Sleep 1 second 4. Submit the login form 5. Verify that it got a status 200 6. Sleep 3 seconds 7. Go to a quick loading page (~ under 1 second) 8. Verify that it got status 200 9. Sleep 3 seconds 10. Go to our highly dynamic server side rendered commonly used effort reporting page (normally takes a full 2 seconds to load) 11. Verify that it got status 200 12. Sleep 3 seconds
Test #1 on our Staging server
Nsstats over time ------------------------ No Load on system: 3:56pm Connection Threads: min 2 max 10 current 2 idle 1 stopping 0 waiting 0 Request Handling: requests 7, queued 0 (0.00%), spooled 6 (85.71%) Request Timing: avg queue time 98.57µs, avg filter time 183.63ms, avg run time 12.91ms avg trace time 3.92ms Active Requests: cns6 172.31.6.115 running GET /z1 0.014718 0 . Starting the test. ~3:58pm 140 VUs (Virtual Users) Connection Threads: min 2 max 10 current 10 idle 0 stopping 0 waiting 73 Request Handling: requests 1.7K, queued 1K (61.48%), spooled 1K (62.24%) Request Timing: avg queue time 660.56ms, avg filter time 5.44ms, avg run time 218.25ms avg trace time 1.31ms Active Requests: cns1639 172.31.207.40 running GET /z2 5.408894 0 . ~3:59pm 300 VUs Connection Threads: min 2 max 10 current 10 idle 0 stopping 0 waiting 90 Request Handling: requests 3.6K, queued 3K (77.81%), spooled 2K (64.06%) Request Timing: avg queue time 1.39s, avg filter time 5.51ms, avg run time 224.59ms avg trace time 1.6ms Active Requests: cns3397 172.31.207.40 running GET /z3 0.815387 0 . ~4:00pm 200 VUs Connection Threads: min 2 max 10 current 10 idle 0 stopping 0 waiting 75 Request Handling: requests 4.9K, queued 4K (82.60%), spooled 3K (66.17%) Request Timing: avg queue time 1.95s, avg filter time 5.77ms, avg run time 268.55ms avg trace time 1.71ms Active Requests: cns4809 172.31.207.40 running GET /z4 0.932554 0 . 4:01pm 49 VUs Connection Threads: min 2 max 10 current 10 idle 3 stopping 0 waiting 0 Request Handling: requests 6.1K, queued 5K (81.76%), spooled 4K (66.21%) Request Timing: avg queue time 2.03s, avg filter time 5.84ms, avg run time 289.01ms avg trace time 1.74ms Active Requests: cns6038 172.31.207.40 running GET /z5 7.99834 0 . K6 summary ------------------ running (4m09.0s), 000/300 VUs, 964 complete and 86 interrupted iterations default ✓ [======================================] 000/300 VUs 4m0s ✗ status is 200 ↳ 99% — ✓ 4051 / ✗ 7 ✗ page succeeded ↳ 99% — ✓ 4041 / ✗ 17 █ setup █ teardown checks.........................: 99.70% ✓ 8092 ✗ 24 << Note: We would like be 100% on our page check for status 200 and page succeeded. data_received..................: 922 MB 3.7 MB/s data_sent......................: 5.2 MB 21 kB/s http_req_blocked...............: avg=2.6ms min=1.48µs med=7.01µs max=41.1ms p(90)=8.31ms p(95)=12.44ms http_req_connecting............: avg=180.19µs min=0s med=0s max=34.22ms p(90)=419.21µs p(95)=465.91µs ✗ http_req_duration..............: avg=3.89s min=441.05µs med=1.15s max=35.08s p(90)=11.28s p(95)=17.46s << Note: It would be nice to be under 8 seconds in the 95th percentile. { expected_response:true }...: avg=3.89s min=4.6ms med=1.15s max=35.08s p(90)=11.28s p(95)=17.4s http_req_failed................: 0.11% ✓ 7 ✗ 6143 << Note: We would like to have 0% failed http requests. http_req_receiving.............: avg=1.16ms min=51.84µs med=178.08µs max=41.42ms p(90)=2.66ms p(95)=4.52ms http_req_sending...............: avg=42.78µs min=8.58µs med=40.67µs max=513.41µs p(90)=76.4µs p(95)=88.37µs http_req_tls_handshaking.......: avg=2.38ms min=0s med=0s max=39.85ms p(90)=7.75ms p(95)=11.64ms http_req_waiting...............: avg=3.89s min=294.73µs med=1.15s max=35.08s p(90)=11.28s p(95)=17.46s http_reqs......................: 6150 24.695368/s iteration_duration.............: avg=32.34s min=2.31µs med=31.28s max=1m6s p(90)=54.78s p(95)=56.5s iterations.....................: 964 3.870949/s vus............................: 1 min=1 max=300 vus_max........................: 300 min=300 max=300 Error.log file shows we are out of available connections during this test. -------------------- ... [13/Sep/2021:16:00:16][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10 [13/Sep/2021:16:00:17][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10 [13/Sep/2021:16:00:18][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10 [13/Sep/2021:16:00:18][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10 [13/Sep/2021:16:00:19][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10 [13/Sep/2021:16:00:20][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10 [13/Sep/2021:16:00:21][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10 ...Some of the longer calls from request monitor during this time. request-monitor/long-calls?lines=1000
... 14.371059 0.003934 1.550067 15.925060 2021 Sep 13 16:00:30 14.290415 0.006291 1.668398 15.965104 2021 Sep 13 16:00:30 14.062884 0.003386 0.007183 14.073453 2021 Sep 13 16:00:29 14.189928 0.003912 1.581149 15.774989 2021 Sep 13 16:00:29 14.176448 0.004008 0.018152 14.198608 2021 Sep 13 16:00:29 14.314284 0.003587 0.010584 14.328455 2021 Sep 13 16:00:29 14.265956 0.004304 1.653035 15.923295 2021 Sep 13 16:00:29 14.447920 0.005010 1.477352 15.930282 2021 Sep 13 16:00:29 14.803571 0.004733 1.380080 16.188384 2021 Sep 13 16:00:29 14.458517 0.004058 1.835248 16.297823 2021 Sep 13 16:00:29 14.572207 0.006749 1.628151 16.207107 2021 Sep 13 16:00:29 14.449219 0.003139 0.006589 14.458947 2021 Sep 13 16:00:28 14.402088 0.007025 1.514557 15.923670 2021 Sep 13 16:00:28 14.547796 0.005566 1.703462 16.256824 2021 Sep 13 16:00:28 14.278776 0.012663 1.880569 16.172008 2021 Sep 13 16:00:28 14.470381 0.019008 1.793124 16.282513 2021 Sep 13 16:00:28 14.950281 0.003875 1.816067 16.770223 2021 Sep 13 16:00:28 14.547664 0.004792 1.457710 16.010166 2021 Sep 13 16:00:28 14.418724 0.019254 0.008703 14.446681 2021 Sep 13 16:00:28 ...
From the data above it looks like we are running out of connections in our connection pool which I believe would cause requests to be put onto the queue to wait until another connection is available.
Note: When we put 5 servers behind a nginx server and the DB on a separate VM our status checks for page success was 100% and our wait times in the 95th percentile were around 7 seconds.
Any suggestions you have that would help us fine tune our configuration for one server would be greatly appreciated.
Thanks for your help, Marty