Forum OpenACS Q&A: OpenACS clustering setup and how it relates to xotcl-core.

We are looking to get some feedback and direction on the implementation of naviserver clustering.

Our growing company is very interested in implement naviserver clustering behind an nginx load balancing server. We have been testing this setup for a little over a week now. The nginx load balancing between the naviservers appears to work great. The main reason for looking into clustering is to scale up our capacity as it relates to naviserver connections, because we have during high loads, run out of available connections at times.

Not surprisingly, In our initial testing we have found that by having multiple naviservers behind nginx (serving round robin) we have much more capacity for connections during our load tests. And we have also found that having the database on a separate VM than the naviserver instances, provides a great speed-up during extreme loads.

In our testing we also use Docker to manage the nginx and backend naviservers containers. We are using an up to date oacs-5.10 development version and are using the latest xotcl-core.

In our testing we have set up the naviserver parameters to allow naviserver clustering under the kernel package (CanonicalServer, ClusterAuthorizedIP, ClusterEnabledP, ClusterPeerIP, EnableLoggingP) in order to properly implement cache flushing across the naviserver cluster.

In testing this configuration we have found that the code does call the cluster servers for anything that is util_memoize cached by calling the pages/SYSTEM/util_memoized_statement.tcl script for each server. However, for things like cached parameters it does not call the cluster servers. See Example below:

Broader overall Question #1:

How much of the cache flushing over the cluster has actually been implemented? We are asking this question along with the following 2 questions which are more specific.

Question #2:

In your opinion what would be the pluses and minus of us making parameters, users info, etc, flushable across all servers in the cluster?

Example: under acs-subsite/www/shared/parameters.tcl

                              - calls parameter-procs.tcl -> parameter::set_value package_id $package_id -parameter $name 
                                             - acs-content-repository-callback-procs.tcl -> callback subsite::parameter_changed
                                                            - if {$package_key eq "acs-content-repository" && $parameter eq "CRFileLocationRoot" && $value ne ""} Then
                                                                           - a) nsv_unset CR_LOCATIONS CR_FILES
                                                                           - b) nsv_set CR_LOCATIONS CR_FILES "$file_location"
                                             - acs.tcl - ad_parameter_cache -delete -set $value $package_id $parameter
                                                            - nsv_unset ad_param_$key $parameter_name
                                                            - NOTE:  <b>We could call code to call each server in the cluster to do the same nsv_unset here...</b>

We also noticed that xotcl-core looks like it has some caching setup for sitenodes when UserXotclSiteNodes is set to true. There also looks like there is a method for handling acs::clusterwide flushing of the generic cache in the base generic caching class (::acs::Cache) although there is a comment saying that it may not be all the way implemented yet.

Question #3:

In your opinion, if we wanted to implement server clustering in the next 2 months, Would it be easier and faster to implement cluster caching across most caches using the util_memoize_flush way of doing things

Or

Would we be better off implementing the xotcl caching for parameters, util_memoized functions etc?

If the later is true do you have any tips as to how to start caching/flushing across servers in the cluster this way for most of our caches.

Thanks for your assistance and direction, Marty

Dear Marty,

you are raising many interesting questions! The short answer is that the cluster support is just partly implemented - simply because with high numbers of cores and a suitable naviserver configuration one one gets a very high scalability (up 1000 permission checked OpenACS requests per second) - and getting all of OpenACS with all its packages fit for clusters is some work, and using cluster make debugging more complex.

If one has many long running connection requests, one should use multiple connection thread pools (have you set this up), and one should consider reducing the blocking time of the requests (what times are you experiencing). Are you using the xotcl request monitor? e.g. the long-calls page gives you a good overview.

found that having the database on a separate VM than the naviserver instances

i do not think this is in general the case, but depends on the memory of the VM, on the number of cores given to the VM, etc. The local communication performance within a machine is always faster in terms of latency and throughput than via TCP/IP. When the a VM is under-configured for the required load (has only e.g. 2 cores, or is running out of memory, not enough shared buffers, ...) then maybe.

how much of the cache flushing over the cluster has actually been implemented

Caching happens on many levels: per-request caching, per-thread caching, caching via ns_cache and caching via nsv. The first one is not relevant for cluster considerations, per-thread caching is intentionally lock-free, but requires measures like a restart (used for things that "never change"). For the other ones, there is the special handling as provided by util_memoize and the custerwide operations. Using util_memoize for everything is not good, since it does not scale, since there are many long locking operations that limit concurrency (check on nsstats the lock times and sort by "total wait"; it is not unlikely that 'util_memoize' cache is top), so the caching work of the last years was devoted in defining multiple caches and cache partitioning.

The clusterwide support should work for you out of the box. Historically, it was developed in xotcl-core, but has been moved partly into acs-tcl since this is useful everywhere (some more generally useful stuff will be moved in future versions from xo:: to the core locations). Slightly simplified it works like the following: instead of writing

::acs::clusterwide nsv_unset $array $key

one can use

::acs::clusterwide nsv_unset $array $key

to execute a command on all machines of a cluster. So prefixing the cache clearing operations with ::acs::clusterwide should do the job.

Be aware that intra-server talk for cache clearing operations can become quite heavy, and that the communication as done right now is not how it should be done. ... many things can be done there, but first it has to be clear that it is worth the effort.

The first check is usually to figure out the local bottlenecks and the configuration, this is usually much less work than going towards the clustered solution.

Hope this answers some of your questions
all the best
-g

Thanks Gustaf,

Sorry for the late reply. I was out on vacation.

We appreciate your help and expertise. Your point is well taken that we should try to fine tune our configuration as much as we can. I will try to describe our setup and testing on our end so we can try to make sense of what we see. I work on the same naviserver instance as Tony Kirkham. I know you have worked with him on some of our setup so that may help you, to some degree, know what type of setup we have.

We did, at one time, have an extra connection thread pool set up for slow runners. However, after recently updating to the newest naviserver with xotcl we do not have one currently. It was our understanding that the new xotcl may take care of some of the long runners automatically without us defining a specific thread pool and telling xotcl about it. Is this not a good assumption? How would we go about telling the new core where our new long running pool(s) resides? Our current pool config is below.

ns_section ns/server/${server}/pools
ns_param        monitor         "Monitoring actions to check the health of the system"
ns_param        image           "Image file requests"

ns_section ns/server/${server}/pool/monitor
ns_param        minthreads 3
ns_param        maxthreads 5
ns_param        map "GET /admin/nsstats"
ns_param        map "GET /SYSTEM"
ns_param        map "GET /ds"
ns_param        map "POST /ds"

ns_section ns/server/${server}/pool/image
ns_param        minthreads 3
ns_param        maxthreads 5
ns_param        map "GET /*.png"
ns_param        map "GET /*.jpg"
ns_param        map "GET /*.PNG"
ns_param        map "GET /*.JPG"
ns_param        map "GET /*.JPEG"
ns_param        map "GET /*.jpeg"
ns_param        map "GET /*.mp4"
ns_param        map "GET /*.MP4"
ns_param        map "GET /*.pdf"
ns_param        map "GET /*.PDF"
ns_param        map "GET /*.gif"
ns_param        map "GET /*.GIF"
Our Server setup:

We have one VM that has 24 cores and 80 Gig of RAM which our live system resides. We also have a staging VM that has 16 cores and 200 Gig of RAM.

The tests below will be run against our staging server.

I have been using a load testing tool called k6 (See: https://k6.io/) to set a base line test for our server and to do a quick 4 minute test.

The base test had the following ramp up options:
export const options = {
    stages: [
      { target: 25, duration: '30s' },			// Ramp up to 25 concurrent users over the 30 seconds 
      { target: 100, duration: '30s' },			// Then ramp up to 100 concurrent users for the next 30 seconds
      { target: 300, duration: '1m' },			// Then ramp up to 300 concurrent users for the next minute
      { target: 100, duration: '1m' },			// Then ramp down to 100 concurrent users for the next minute
      { target: 0, duration: '1m' },				// Finally ramp down to 0 concurrent users for the next minute
    ]
  };
Each k6 session does the following things:
1.	Go to login page
2.	Verify that it got a status 200 from login page
3.	Sleep 1 second
4.	Submit the login form
5.	Verify that it got a status 200
6.	Sleep 3 seconds
7.	Go to a quick loading page (~ under 1 second)
8.	Verify that it got status 200
9.	Sleep 3 seconds
10.	Go to our highly dynamic server side rendered commonly used effort reporting page (normally takes a full 2 seconds to load)
11.	Verify that it got status 200
12.	Sleep 3 seconds

Test #1 on our Staging server

Nsstats over time
------------------------
No Load on system:  3:56pm
Connection Threads:	min 2 max 10 current 2 idle 1 stopping 0 waiting 0
Request Handling:	requests 7, queued 0 (0.00%), spooled 6 (85.71%)
Request Timing:	avg queue time 98.57µs, avg filter time 183.63ms, avg run time 12.91ms avg trace time 3.92ms
Active Requests:	cns6 172.31.6.115 running GET /z1 0.014718 0 .

Starting the test.

~3:58pm
140 VUs (Virtual Users)
Connection Threads:	min 2 max 10 current 10 idle 0 stopping 0 waiting 73
Request Handling:	requests 1.7K, queued 1K (61.48%), spooled 1K (62.24%)
Request Timing:	avg queue time 660.56ms, avg filter time 5.44ms, avg run time 218.25ms avg trace time 1.31ms
Active Requests:	cns1639 172.31.207.40 running GET /z2 5.408894 0 .

~3:59pm
300 VUs
Connection Threads:	min 2 max 10 current 10 idle 0 stopping 0 waiting 90
Request Handling:	requests 3.6K, queued 3K (77.81%), spooled 2K (64.06%)
Request Timing:	avg queue time 1.39s, avg filter time 5.51ms, avg run time 224.59ms avg trace time 1.6ms
Active Requests:	cns3397 172.31.207.40 running GET /z3 0.815387 0 .

~4:00pm
200 VUs
Connection Threads:	min 2 max 10 current 10 idle 0 stopping 0 waiting 75
Request Handling:	requests 4.9K, queued 4K (82.60%), spooled 3K (66.17%)
Request Timing:	avg queue time 1.95s, avg filter time 5.77ms, avg run time 268.55ms avg trace time 1.71ms
Active Requests:	cns4809 172.31.207.40 running GET /z4 0.932554 0 .

4:01pm
49 VUs
Connection Threads:	min 2 max 10 current 10 idle 3 stopping 0 waiting 0
Request Handling:	requests 6.1K, queued 5K (81.76%), spooled 4K (66.21%)
Request Timing:	avg queue time 2.03s, avg filter time 5.84ms, avg run time 289.01ms avg trace time 1.74ms
Active Requests:	cns6038 172.31.207.40 running GET /z5 7.99834 0 .

K6 summary 
------------------
running (4m09.0s), 000/300 VUs, 964 complete and 86 interrupted iterations
default ✓ [======================================] 000/300 VUs  4m0s

     ✗ status is 200
      ↳  99% — ✓ 4051 / ✗ 7
     ✗ page succeeded
      ↳  99% — ✓ 4041 / ✗ 17

     █ setup

     █ teardown

     checks.........................: 99.70% ✓ 8092      ✗ 24         << Note:  We would like be 100% on our page check for status 200 and page succeeded.
     data_received..................: 922 MB 3.7 MB/s
     data_sent......................: 5.2 MB 21 kB/s
     http_req_blocked...............: avg=2.6ms    min=1.48µs   med=7.01µs   max=41.1ms   p(90)=8.31ms   p(95)=12.44ms
     http_req_connecting............: avg=180.19µs min=0s       med=0s       max=34.22ms  p(90)=419.21µs p(95)=465.91µs
   ✗ http_req_duration..............: avg=3.89s    min=441.05µs med=1.15s    max=35.08s   p(90)=11.28s   p(95)=17.46s    << Note:  It would be nice to be under 8 seconds in the 95th percentile.
       { expected_response:true }...: avg=3.89s    min=4.6ms    med=1.15s    max=35.08s   p(90)=11.28s   p(95)=17.4s
     http_req_failed................: 0.11%  ✓ 7         ✗ 6143     << Note:  We would like to have 0% failed http requests.
     http_req_receiving.............: avg=1.16ms   min=51.84µs  med=178.08µs max=41.42ms  p(90)=2.66ms   p(95)=4.52ms
     http_req_sending...............: avg=42.78µs  min=8.58µs   med=40.67µs  max=513.41µs p(90)=76.4µs   p(95)=88.37µs
     http_req_tls_handshaking.......: avg=2.38ms   min=0s       med=0s       max=39.85ms  p(90)=7.75ms   p(95)=11.64ms
     http_req_waiting...............: avg=3.89s    min=294.73µs med=1.15s    max=35.08s   p(90)=11.28s   p(95)=17.46s
     http_reqs......................: 6150   24.695368/s
     iteration_duration.............: avg=32.34s   min=2.31µs   med=31.28s   max=1m6s     p(90)=54.78s   p(95)=56.5s
     iterations.....................: 964    3.870949/s
     vus............................: 1      min=1       max=300
     vus_max........................: 300    min=300     max=300

Error.log file shows we are out of available connections during this test.
--------------------
...
[13/Sep/2021:16:00:16][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:17][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:18][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:18][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:19][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:20][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
[13/Sep/2021:16:00:21][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
...
Some of the longer calls from request monitor during this time. request-monitor/long-calls?lines=1000
...
14.371059	0.003934	1.550067	15.925060	2021 Sep 13 16:00:30
14.290415	0.006291	1.668398	15.965104	2021 Sep 13 16:00:30
14.062884	0.003386	0.007183	14.073453	2021 Sep 13 16:00:29
14.189928	0.003912	1.581149	15.774989	2021 Sep 13 16:00:29
14.176448	0.004008	0.018152	14.198608	2021 Sep 13 16:00:29
14.314284	0.003587	0.010584	14.328455	2021 Sep 13 16:00:29
14.265956	0.004304	1.653035	15.923295	2021 Sep 13 16:00:29
14.447920	0.005010	1.477352	15.930282	2021 Sep 13 16:00:29
14.803571	0.004733	1.380080	16.188384	2021 Sep 13 16:00:29
14.458517	0.004058	1.835248	16.297823	2021 Sep 13 16:00:29
14.572207	0.006749	1.628151	16.207107	2021 Sep 13 16:00:29
14.449219	0.003139	0.006589	14.458947	2021 Sep 13 16:00:28
14.402088	0.007025	1.514557	15.923670	2021 Sep 13 16:00:28
14.547796	0.005566	1.703462	16.256824	2021 Sep 13 16:00:28
14.278776	0.012663	1.880569	16.172008	2021 Sep 13 16:00:28
14.470381	0.019008	1.793124	16.282513	2021 Sep 13 16:00:28
14.950281	0.003875	1.816067	16.770223	2021 Sep 13 16:00:28
14.547664	0.004792	1.457710	16.010166	2021 Sep 13 16:00:28
14.418724	0.019254	0.008703	14.446681	2021 Sep 13 16:00:28
...

From the data above it looks like we are running out of connections in our connection pool which I believe would cause requests to be put onto the queue to wait until another connection is available.

Note: When we put 5 servers behind a nginx server and the DB on a separate VM our status checks for page success was 100% and our wait times in the 95th percentile were around 7 seconds.

Any suggestions you have that would help us fine tune our configuration for one server would be greatly appreciated.

Thanks for your help, Marty
It was our understanding that the new xotcl may take care of some of the long runners automatically without us defining a specific thread pool and telling xotcl about it. Is this not a good assumption

Well, newer version of the xotcl-request monitor can automatize some steps, but it needs configuration. The xotcl-request monitor can map requests to a "slow" pool when it is defined by the site-admin. The site-admin has also to configure (e.g. min/max threads, rejectoverruns, queue size, ...) this connection pool. From the snippet of the configuration files, it seem, that you have no "slow" pool defined, so no automatic mapping will happen.

From the output i see that the default pool has minthreads 2 and maxthreads 10. I would recommend to define 25/25 on your testing instance based on:

live 24 cores and 80 GB of RAM, staging 16 cores and 200 GB
For a rough estimate, one can run with 24 cores 96 connection threads (assuming typical OpenACS loads) and with 16 about 64.

When looking at the 140 VUs output i see

Request Handling:	requests 1.7K, queued 1K (61.48%), spooled 1K (62.24%)

The number of queued requests is way to high. On our production system (typically between 600 and 3000 concurrent users) we have typically less the 0.01% queued requests in the default pool.

Request Timing:	avg queue time 660.56ms, avg filter time 5.44ms, avg run time 218.25ms avg trace time 1.31ms

You see here that the avg queue time (660ms, time waiting to be served) is much longer the than the processing time (avg run time). On our production system (currently 650 users) we have now

avg queue time 465.04µs, avg filter time 4.18ms, avg run time 104.08ms avg trace time 4.27ms

the output

[13/Sep/2021:16:00:16][1.7f1a15c5a700][-driver:nsssl_v4:0-] Notice: [east_intrastage pool ] All available connections are used, waiting 90 idle 0 current 10
tells that the queue is filled up (100 entries). The queue size is defined per pool via "maxconnections", we have on our live system this value set to 600. One can fine-tune the behavior of queue overruns via "rejectoverrun". A larger queue makes the system more friendly when there are sometime large request bursts.

While the values above are average values, the values below (long-calls) are from individual requests:

14.371059	0.003934	1.550067	15.925060	2021 Sep 13 16:00:30

The columns are: Queuetime Filtertime Runtime Totaltime.
The requests above were waiting 14secs in the queue before being served, whereas the actual run-time was about 1.5 secs. When you get rid of the queuing time, the performance will improve by a factor of 10. Increasing the threads as indicated above should help significantly.

For this test, setting up the "slow" pool will not improve the benchmark times, since the pools make sure that "slow" requests do not kill the "fast" requests on your server. This is important to ensure liveliness of the server, when the majority of requests is "fast", and only a few are "slow", and you do not want that someone firing e.g. 100 "slow" requests brings you server to its knees.

By increasing the number of maxthreads form 10 (default) to e.g. 50 should increase the throughput by a factor of 5.

Hope, this helps
-gn

Thanks Gustaf,

We appreciate your expertise in helping us tune our system. It has helped a lot!

I was successful in setting up the "slow" pool. I was wondering if there is a setting to only map a request to the slow pool if it is only over 30 or 40 seconds. I found that it started to put our effort reporting page on the slow pool when the system was a little stressed. So I removed the slow pool for the following tests for now.

I have been playing around with the configuration and we have decided to go up to 48 cores on our staging system with the 200 Gig of RAM.

Here is what I found to be somewhat optimal for the 48 cores.

We are not running out of Available connections anymore and the queued request are a lot lower as well as the avg queue times. The K6 tests show that all http request are status 200 as well.

I was wondering if you see any problems with the configuration and if you had suggestions for changes.

Again, thanks for your help

Marty
ns_section ns/server/${server}
ns_param        maxconnections  2000    ;# 100; number of allocated connection structures
ns_param        maxthreads      200     ;# 10; maximal number of connection threads
ns_param        minthreads      200     ;# 1; minimal number of connection threads

ns_param        connsperthread  10000   ;# 10000; number of connections (requests) handled per thread
                                        ;# Setting connsperthread to > 0 will cause the thread to
                                        ;# graciously exit, after processing that many
                                        ;# requests, thus initiating kind-of Tcl-level
                                        ;# garbage collection.

# ns_param      threadtimeout   120     ;# 120; timeout for idle threads.
                                        ;# In case, minthreads < maxthreads, threads
                                        ;# are shutdown after this idle time until
                                        ;# minthreads are reached.

# ns_param      lowwatermark    10      ;# 10; create additional threads above this queue-full percentage
ns_param        highwatermark   100     ;# 80; allow concurrent creates above this queue-is percentage
                                        ;# 100 means to disable concurrent creates
ns_param        rejectoverrun   true   ;# If set to true; NaviServer sends on queue overruns an error "503 service unavailable".
                                        ;#  The request is not added to the waiting list.

# Compress response character data: ns_return, ADP etc.
#
ns_param        compressenable  off     ;# false, use "ns_conn compress" to override

ns_section ns/server/${server}/pools
ns_param        monitor         "Monitoring actions to check the health of the system"
ns_param        image           "Image file requests"
#ns_param        slow            "Slow requests"
ns_param        fast            "Fast requests"
ns_section ns/server/${server}/pool/monitor
ns_param        minthreads 3
ns_param        maxthreads 5

ns_param        map "GET /admin/nsstats"
ns_param        map "GET /SYSTEM"
ns_param        map "GET /ds"
ns_param        map "POST /ds"
ns_section ns/server/${server}/pool/image
ns_param        minthreads 3
ns_param        maxthreads 5

ns_param        map "GET /*.png"
ns_param        map "GET /*.jpg"
ns_param        map "GET /*.PNG"
ns_param        map "GET /*.JPG"
ns_param        map "GET /*.JPEG"
ns_param        map "GET /*.jpeg"
ns_param        map "GET /*.mp4"
ns_param        map "GET /*.MP4"
ns_param        map "GET /*.pdf"
ns_param        map "GET /*.PDF"
ns_param        map "GET /*.gif"
ns_param        map "GET /*.GIF"

#ns_section ns/server/${server}/pool/slow
#ns_param        minthreads 15
#ns_param        maxthreads 15

ns_section ns/server/${server}/pool/fast
ns_param        minthreads 6
ns_param        maxthreads 6

# Test  w 48 CPU's   200 Gig RAM
#   maxconnections 2000
#   maxthreads 200
#   minthreads 200 
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true



150 VUs
Connection Threads:	min 200 max 200 current 200 idle 151 stopping 0 waiting 0
Request Handling:	requests 1.8K, queued 0 (0.00%), spooled 1K (63.55%)
Request Timing:	avg queue time 161.33µs, avg filter time 78.53ms, avg run time 212.77ms avg trace time 972.63µs

300 VUs
Connection Threads:	min 200 max 200 current 200 idle 157 stopping 0 waiting 0
Request Handling:	requests 5.4K, queued 508 (9.46%), spooled 4K (66.81%)
Request Timing:	avg queue time 295.87ms, avg filter time 1.74s, avg run time 251.96ms avg trace time 1.1ms

150 VUs
Connection Threads:	min 200 max 200 current 200 idle 189 stopping 0 waiting 0
Request Handling:	requests 7.1K, queued 627 (8.80%), spooled 5K (65.20%)
Request Timing:	avg queue time 260.7ms, avg filter time 2s, avg run time 255.49ms avg trace time 1.13ms

70 VUs
Connection Threads:	min 200 max 200 current 200 idle 141 stopping 0 waiting 0
Request Handling:	requests 7.9K, queued 627 (7.98%), spooled 5K (65.70%)
Request Timing:	avg queue time 236.39ms, avg filter time 1.94s, avg run time 264.67ms avg trace time 1.12ms

10 VUs
Connection Threads:	min 200 max 200 current 200 idle 199 stopping 0 waiting 0
Request Handling:	requests 8.4K, queued 627 (7.48%), spooled 6K (66.52%)
Request Timing:	avg queue time 221.7ms, avg filter time 1.85s, avg run time 275.66ms avg trace time 1.13ms


K6 Summary
running (4m09.3s), 000/300 VUs, 1359 complete and 38 interrupted iterations
default ✓ [======================================] 000/300 VUs  4m0s

     ✓ status is 200
     ✗ page succeeded
      ↳  99% — ✓ 5542 / ✗ 8

     █ setup

     █ teardown

     checks.........................: 99.92% ✓ 11092     ✗ 8
     data_received..................: 1.3 GB 5.2 MB/s
     data_sent......................: 7.0 MB 28 kB/s
     http_req_blocked...............: avg=1.46ms   min=1.27µs  med=6.61µs   max=51.42ms p(90)=4.1ms    p(95)=4.87ms
     http_req_connecting............: avg=147.33µs min=0s      med=0s       max=48.09ms p(90)=376.22µs p(95)=442.16µs
   ✗ http_req_duration..............: avg=2.32s    min=4.86ms  med=224.41ms max=24.15s  p(90)=8.5s     p(95)=13.05s
       { expected_response:true }...: avg=2.32s    min=4.86ms  med=224.41ms max=24.15s  p(90)=8.5s     p(95)=13.05s
     http_req_failed................: 0.00%  ✓ 0         ✗ 8355
     http_req_receiving.............: avg=689.98µs min=39.82µs med=179.82µs max=70.57ms p(90)=1.71ms   p(95)=1.89ms
     http_req_sending...............: avg=39.42µs  min=8.56µs  med=35.11µs  max=15.19ms p(90)=66.52µs  p(95)=76.12µs
     http_req_tls_handshaking.......: avg=1.28ms   min=0s      med=0s       max=42.24ms p(90)=3.62ms   p(95)=4.51ms
     http_req_waiting...............: avg=2.32s    min=4.66ms  med=223.99ms max=24.15s  p(90)=8.5s     p(95)=13.05s
     http_reqs......................: 8355   33.512465/s
     iteration_duration.............: avg=23.56s   min=1.64µs  med=22.73s   max=51.14s  p(90)=36.62s   p(95)=38.12s
     iterations.....................: 1359   5.45104/s
     vus............................: 1      min=1       max=300
     vus_max........................: 300    min=300     max=300

Hi Marty,

wondering ... there is a setting to only map a request to the slow pool if it is only over 30 or 40 seconds

Well, there should be absolutely NO request processed in the foreground taking longer than say 10 secs. Currently, the various thresholds are hard-coded in the request monitor (mapping to slow pool, warnings for queue-times and filter times, ...., mapping to color-schemes) based on best practice of several large systems, but it is certainly possible to make these configurable in the future. The threshold for slow-pool mapping is 3 seconds. Note, that in case, your server has many such requests (which can't be made faster), one should define a larger "slow" pool. Requests in the "slow" pool are processed with the same efficiency as requests in a "fast" pool ("slow" and "fast" are just names), one can give the slow pool as many threads as you like.

I was wondering if you see any problems with the configuration and if you had suggestions for changes.

You have currently:

            min max
   default  200 200
   monitor    3   5
   image      3   5
   slow      15  15
   fast       6   6

200 min threads and 200 max threads for the default pool is a lot, but certainly, the right values depend on the application and on the machine you are using. When your jobs are CPU-intense, then it can happen that concurrent jobs slow each other more down than it helps. Many processors have a degressive clock rate (when many CPUs are busy, the temperature in the CPU goes up and the clock rate is reduced). Choosing the right CPU-type, is not so easy, often deactivating hyper-threading can speed up things (don't believe your vendor), etc. .... but this is a long story and brings us away from your question.

When there is the DB + nsd on the same machine, don't forget that the DB needs as well a lot of resources (CPU and memory). Every (active) thread will require a DB-connection, and the resources available per SQL query (postgres process) will be less, the SQL queries become slower, etc.

My recommendation is: if you have multiple connection pools, and the default pool is responsible for most of the hard work, don't make this much larger than the number of cores you have. When you have long running queries, make sure, you can't handle these via spooling or background processing.

From your configuration, i see a pool "fast", which is probably unused (and can be removed), since there is no mapping defined. I would also recommend to use the "image" pool minthreads=maxthreads. You should see from the per-pool statistics how many threads you need. Here are is the output from our "fast" pool used for images (with the given load, 6 threads can server 210 requests in this pool per second).

Connection Threads: min 6 max 6 current 6 idle 6 stopping 0 waiting 0 started 23 maxconnections 600
Request Handling:   requests 194.5K, queued 149 (0.08%), spooled 124K (63.55%), dropped 0 possible-max-reqs 210.5rps
Request Timing: avg queue time 239.75µs, avg filter time 4.69ms, avg run time 22.1ms avg trace time 1.72ms

We have on our production environment currently:

            min max
   default   60  60
   bots       3   3
   monitor    2   4
   fast       6   6
   slow      15  15

What i can see from your figures is that the filter time is very high, already for 150 VUs:

Request Timing: avg queue time 161.33µs, avg filter time 78.53ms, avg run time 212.77ms avg trace time 972.63µs

Typically, the filters are responsible for authorization and setup.... but seeing there 78ms, and for 300 VUs 1.7s makes me wonder, what's going on. Maybe you are running out of DB connections (as configured in the NaviServer configuration file). Or you are running out of SQL connection from you DB configuration.

Check out the section DB-Pools in nsstats. It should look like the snipped below.

pool1:  statements 99M gethandles 2.8M handles 170 connected 64 used 135 waittime 3.872729 sqltime 82712.668212 avgwaittime 1.4µs avgsqltime 835.8µs
pool2:  statements 0 gethandles 2 handles 2 connected 0 used 1 waittime 0.002325 sqltime 0.000000 avgwaittime 1.2ms
pool3:  statements 0 gethandles 2 handles 2 connected 0 used 1 waittime 0.002555 sqltime 0.000000 avgwaittime 1.3ms

If avgwaittime is high, you have the problem. Newer OpenACS versions don't use pool2 and pool3 (for nested SQL calls, which were troublesome for rollbacks).

Hope this helps!

-g

Thanks Gustaf,

You were right the avgwaittime on pool1 was high

pool1:	statements 1.1M gethandles 8.6K handles 15 connected 0 used 15 waittime 14693.460994 sqltime 1724.604281 avgwaittime 1.7s avgsqltime 1.5ms

The pool1 connections were still at 15. I increased them to 70 and brought our maxthreads down to 60.

Here is my test with this configuration

# 48 cores   200 Gig RAM
#   maxconnections 2000
#   maxthreads 60
#   minthreads 60
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true
#    image pool 6/6
# ns_section ns/db/pool/pool1
	ns_param        connections        70

150 VUs Connection Threads: min 60 max 60 current 60 idle 33 stopping 0 waiting 0 Request Handling: requests 2.4K, queued 0 (0.00%), spooled 2K (64.39%) Request Timing: avg queue time 167.94µs, avg filter time 4.87ms, avg run time 231.84ms avg trace time 996.02µs

pool1: statements 288.5K gethandles 2.6K handles 70 connected 5 used 34 waittime 0.477450 sqltime 447.052551 avgwaittime 186.3µs avgsqltime 1.5ms

300 VUs Connection Threads: min 60 max 60 current 60 idle 18 stopping 0 waiting 0 Request Handling: requests 7.4K, queued 2K (21.37%), spooled 5K (64.86%) Request Timing: avg queue time 87.8ms, avg filter time 8.85ms, avg run time 333.63ms avg trace time 3.23ms

pool1: statements 923.4K gethandles 7.6K handles 70 connected 17 used 61 waittime 0.998778 sqltime 1997.236181 avgwaittime 132.1µs avgsqltime 2.2ms

150 VUs Connection Threads: min 60 max 60 current 60 idle 7 stopping 0 waiting 0 Request Handling: requests 11.3K, queued 3K (29.84%), spooled 7K (66.01%) Request Timing: avg queue time 324.46ms, avg filter time 9.84ms, avg run time 416.1ms avg trace time 3.87ms

pool1: statements 1.5M gethandles 11.5K handles 70 connected 7 used 62 waittime 1.052769 sqltime 3914.273814 avgwaittime 91.8µs avgsqltime 2.5ms

70 VUs Connection Threads: min 60 max 60 current 60 idle 57 stopping 0 waiting 0 Request Handling: requests 13K, queued 3K (25.79%), spooled 9K (66.52%) Request Timing: avg queue time 280.5ms, avg filter time 9.11ms, avg run time 414.92ms avg trace time 3.51ms

pool1: statements 1.8M gethandles 13.2K handles 70 connected 57 used 62 waittime 1.054629 sqltime 4429.716496 avgwaittime 79.6µs avgsqltime 2.4ms

K6 Summary running (4m09.2s), 000/300 VUs, 2225 complete and 0 interrupted iterations default ✓ [======================================] 000/300 VUs 4m0s ✗ status is 200 ↳ 99% — ✓ 8889 / ✗ 1 ✗ page succeeded ↳ 99% — ✓ 8886 / ✗ 4 checks.........................: 99.97% ✓ 17775 ✗ 5 data_received..................: 2.1 GB 8.3 MB/s data_sent......................: 11 MB 44 kB/s http_req_blocked...............: avg=2.35ms min=1.27µs med=4.83µs max=1.85s p(90)=4.9ms p(95)=7.77ms http_req_connecting............: avg=762.21µs min=0s med=0s max=1.84s p(90)=340.97µs p(95)=392.74µs ✓ http_req_duration..............: avg=699.68ms min=4.73ms med=43.58ms max=5.8s p(90)=2.63s p(95)=3.23s { expected_response:true }...: avg=699.74ms min=4.73ms med=43.59ms max=5.8s p(90)=2.63s p(95)=3.23s http_req_failed................: 0.00% ✓ 1 ✗ 13342 http_req_receiving.............: avg=927.2µs min=42.66µs med=167.63µs max=1.63s p(90)=1.68ms p(95)=2.29ms http_req_sending...............: avg=32.23µs min=7.95µs med=28.46µs max=472.34µs p(90)=58.33µs p(95)=66.47µs http_req_tls_handshaking.......: avg=1.56ms min=0s med=0s max=37.52ms p(90)=4.48ms p(95)=7.27ms http_req_waiting...............: avg=698.72ms min=4.45ms med=43.41ms max=5.8s p(90)=2.63s p(95)=3.23s http_reqs......................: 13343 53.54756/s iteration_duration.............: avg=14.18s min=1.74µs med=12.87s max=22.24s p(90)=18.83s p(95)=19.99s iterations.....................: 2225 8.929275/s vus............................: 1 min=1 max=300 vus_max........................: 300 min=300 max=300

These numbers look a lot better! If you see anything that seems amiss let me know. We appreciate all of your help and expertise.

Sincerely,
Marty

you mentioned previously:
It would be nice to be under 8 seconds in the 95th percentile.
...
NGINX setup .... 95th percentile were around 7 seconds.
now, the http_req_duration has p(95)=3.23s

i see for 150 VU no queued request, but queuing starts with 300 VU. With 300 VU, the filter time has nearly doubled, and the runtime is 50% higher. So there is a resource bottleneck somewhere.

The avgwaittime seems stable now (enough connections now), but it is interestingly much higher than on the posted numbers from our system. ... but the avgsqltime goes quickly up from 150 to 300 VUs.

I would also recommend to set log_min_duration_statement to e.g. 1 sec and check the SQL log file for slow SQL calls to identify potentially DB bottlenecks.

Do you experience locks during the tests? Check in nsstats [1].

all the best
-gn

[1] admin/nsstats?@page=locks&reversesort=1&col=8

Thanks Gustaf,

I did set the log_min_duration_statement to 1 second and found some queries to optimize. I then ran our K6 load test and we are down to 1.01 seconds in the 95th percentile! :)

I very much appreciate your master class on tuning naviserver.

Now that we have optimized one instance. I did run the nginx with 3 naviservers test and found that we are geting the same times with nginx serving 3 naviservers with the following config on each of the 3 naviservers. However, I had not put in the clustering yet with that test.

# 48 CPU's   200 Gig RAM
# Nginx with 3 Naviservers
#   maxconnections 1000
#   maxthreads 20
#   minthreads 20 
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true 
#    image pool 6/6 
# ns_section ns/db/pool/pool1
	ns_param        connections        23  
#  DB on Same VM   

There were three main reason for us looking into nginx in the beginning of this.

  1. To get over the 'All available connections are used up' errors that we would see from time to time. This is now solved! thank you :)
  2. If one naviserver fails (core dumps with signal 11) and restarts, nginx just marks that server as down and skips it in the round robin and we see no downtime
  3. Nginx can be set up to allow non-schema change upgrades to happen in the middle of the working day without us restarting the server and therefore seeing no downtime for our users. See: https://www.nginx.com/faq/how-does-zero-downtime-configuration-testingreload-in-nginx-plus-work/. Our developers and users would like this because we could make a bug fix and push it out during the day without restarting naviserver which can take about 1 minute to restart

Let me explain number 2 a bit more, in case you have seen this error before. If you want me to open a new discussion thread I can if you like. We have for some years now been getting a signal 11 core dump from time to time. It has not been too big of a deal because naviserver would restart by it's self and we would only be down for 30 seconds or so back then. However, over the past year we have noticed it happens with more frequency (probably 3 or 4 times a week) but we have not been able to track it down. In the log file the core dump looks like the following:

...
[16/Sep/2021:08:59:39][1.7f36daffd700][-socks-] Fatal: received [16/Sep/2021:08:59:39][1.7f36e3fff700][-driver:nsssl_v4:0-] Notice: ... sockAccept accepted 2 connections
[16/Sep/2021:08:59:39][1.7f36e3fff700][-driver:nsssl_v4:0-] Notice: ... sockAccept accepted 2 connections
[16/Sep/2021:08:59:39][1.7f36daffd700][-socks-] Fatal: received fatal signal 11
ORA-24550: signal received: [si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x1]
kpedbg_dmp_stack()+396 -kpeDbgCrash()+204 -kpeDbgSignalHandler()+113 -skgesig_sigactionHandler()+258 -__sighandler() -gsignal()+203 -abort()+299[16/Sep/2021:08:59:48][1.7efe2e4f6800][-main:conf-] Notice: nsmain: NaviServer/4.99.21 (tar-4.99.21) starting
...

As you can see it is an oracle error from what looks like the oracle driver. Our main DB is postgres (99%) but we also connect up to an oracle DB for certain queries.

Again I can open another discussion for this item if you like. Just wanted you to know the three reasons we were looking into nginx.

Thanks again, for your willingness to share your expertise and for helping us scale/tune our server!

Sincerely, Marty
and we are down to 1.01 seconds in the 95th percentile!
sounds good. hopefully this not only on this synthetic test so, but as well on the real application.

The messages "all connection are used" is a condition, which can be used as well with all other servers, but naviserver tells the user what is going on.

NaviServer crashes are seldom (on our production site very few times a year, i see the last one from jan 18, most times in explainable cases). In case, you see many of these (3 cores per week qualifies as many), you should compile NaviServer and Tcl with -g an keep the cores for determining the cause.

For the snipped you are showing, It would be easy for me to say "it the problem of the oracle driver", ... but probably it is not. Oracle just tell the system that there was an abort() happening. It would be certainly interesting, where this abort() is coming from (via core).

bug fix and push it out during the day without restarting naviserver which can take about 1 minute to restart

When using OpenACS, you can use the reload feature from the package manager to reload packages without a restart. Why you are not using this?

I am not saying that horizontal scaling is a bad idea, but - depending on your application - that might be a longer way, especially for cache coherency. One has in NaviServer per-request caches, per-thread caches, caches via ns_cache* and nsv, where ns_cache supports transaction semantics (if there is a complex transaction with API calls performed, where the API calls cache some content, but the transaction fails, the cached content needs to follow as well the first three characters of the ACID properties). Getting these correct for all packages is a longer way to go. The easiest approach would be to deactivate all caching in OpenACS, but this has some performance impact as well.

Thanks Gustaf,

I will look into compiling naviserver with the -g option to see if we can generate a core file that can generate a backtrace with gdb.

We had started a standard some time ago to always restart the server after an upgrade because of some oddities some developers were seeing. We believe it was when a developer would instantiate a proc in the shell (for debug purposes) that the issue would arise. After talking to Tony about it we think that you are right and we should just use the "Reload" option during day releases

Your point is well taken about caching. It would be interesting to disable the caching for testing purposes. Is there a parameter that does this or a compile option? Or would we need to disable it for each type of cache by modifying the code?

Thanks for your help, Marty
Just a question about compiling with debug info:

I can run the install-ns.sh script to build naviserver but I do not see that it has a -g option? I can see that "make" is run inside the script with --enable-symbols though. Does this mean it has the debug symbols by default?


if [ $with_mongo = "1" ] ; then
    echo "------------------------ WITH MONGO"

    ./configure --enable-threads --enable-symbols \
                --prefix=${ns_install_dir} --exec-prefix=${ns_install_dir} --with-tcl=${ns_install_dir}/lib \
                --with-nsf=../../ \
                --with-mongoc=/usr/local/include/libmongoc-1.0/,/usr/local/lib/ \
                --with-bson=/usr/local/include/libbson-1.0,/usr/local/lib/
else
    ./configure --enable-threads --enable-symbols \
                --prefix=${ns_install_dir} --exec-prefix=${ns_install_dir} --with-tcl=${ns_install_dir}/lib
fi

${make}
${make} install

Marty
When I look at the gcc compile lines it does look like there is a -g in there. So I assume it does put them in by default.
gcc -DPACKAGE_NAME=\"tdom\" -DPACKAGE_TARNAME=\"tdom\" -DPACKAGE_VERSION=\"0.9.1\" -DPACKAGE_STRING=\"tdom\ 0.9.1\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DHAVE_GETRANDOM=1 -DXML_DEV_URANDOM=1 -DXML_POOR_ENTROPY=1 -DBUILD_tdom=/\*\*/ -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_MEMMOVE=1 -DHAVE_BCOPY=1 -DXML_POOR_ENTROPY=1 -DUSE_THREAD_ALLOC=1 -D_REENTRANT=1 -D_THREAD_SAFE=1 -DTCL_THREADS=1 -DUSE_TCL_STUBS=1 -DUSE_TCLOO_STUBS=1 -DMODULE_SCOPE=extern\ __attribute__\(\(__visibility__\(\"hidden\"\)\)\) -DHAVE_HIDDEN=1 -DHAVE_CAST_TO_UNION=1 -D_LARGEFILE64_SOURCE=1 -DTCL_WIDE_INT_IS_LONG=1 -DUSE_TCL_STUBS=1 -DXML_DTD=1 -DXML_NS=1 -DTDOM_NO_UNKNOWN_CMD=1 -DUSE_NORMAL_ALLOCATOR=1  -I../expat -I../generic -I"/usr/local/ns/include" -I.    -g -O2 -pipe -O2 -fomit-frame-pointer -DNDEBUG -Wall -fPIC  -c `echo ../generic/domxpath.c` -o domxpath.o

The line you are showing is from compiling tdom. The install-ns script does not try to enforce same compiler flags for all packages, one has to care about this separately (already the tcl packages have different flags definitions).

Assuming have have a recent install-ns configured for naviserver 4.99.22 and tcl8.6.11, the following command make sure, tcl + naviserver is complied with "-g":

cd  /usr/local/src/naviserver-4.99.22
make clean all install "CFLAGS_OPTIMIZE=-O2 -g"

cd /usr/local/src/tcl8.6.11/unix
make clean all "CFLAGS_OPTIMIZE=-O2 -g"
(cd  /usr/local/src/tcl8.6.11/unix/pkgs/thread2.8.6; make clean all "CFLAGS_DEFAULT=-O2 -g -DNS_AOLSERVER=1 -I/usr/local/ns/include")
make install

Is there a parameter that does this or a compile option?

no, ... and given the different is would not be feasible (how should a call to nsv_* know by itself that this is for caching or not). For caching via ns_cache_* it would be relative simple. Depending on the application, deactivating caching is still the lower hanging fruit...

-g

I've added for you on bitbucket an experimental global parameter "nocache" for NaviServer, which deactivates on the C level all caching via ns_cache_* (but maintaining some statistics). Default is false; activate it in the section ns/parameters if desired.

By this you can estimate the performance impact, when caching via ns_cache is deactivated for your site when considering horizontal scaling. of course, the other forms for caching are not addressed by this.

Marty: In case you get a reasonable performance with the experimental "nocache" configuration file parameter, i could help you with more nocache-support in OpenACS.

Small update: the parameter for controlling the caching mode is now called "cachingmode", since this leaves more room for potential other future caching modes (e.g. some improved cluster mode or special caching nodes)

ns_section ns/parameters {
   ...
   ns_param        cachingmode none   ;# experimental caching mode
   ...
}

The acs-cache-procs are updated such that the per-thread-cache works like a per-request-cache when the "cachingmode" is configured to "none".

Thanks Gustaf,

Sorry for the late reply, I had been pulled away on migrating some servers. I did compile up the newest version back on the 22nd and set the nocache parameter. We did get acceptable speeds without caching from this particular test. I had not setup the cluster parameters yet though. Below are the results of the tests with and without nocache.

Note: These tests were run with our optimized queries that we fixed in our effort reporting package. When I ran the tests with the non-optimized slower queries, the difference was almost 1 second slower between cached and non cached tests in the 95th percentile

with nocache = false

# Test #23   48 CPU's   200 Gig RAM
# Nginx with 3 Naviservers
#   maxconnections 1000
#   maxthreads 20  
#   minthreads 20  
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true 
#    image pool 6/6    
# ns_section ns/db/pool/pool1
    ns_param        connections        23    
#  DB on Same VM
# ns param nocache = false
 
150 VUs
Connection Threads: min 20 max 20 current 20 idle 16 stopping 0 waiting 0
Request Handling:   requests 858, queued 0 (0.00%), spooled 554 (64.57%)
Request Timing: avg queue time 105.18µs, avg filter time 4.24ms, avg run time 161.33ms avg trace time 1.03ms
 
pool1:  statements 107.5K gethandles 970 handles 23 connected 7 used 13 waittime 0.165362 sqltime 99.706249 avgwaittime 170.5µs avgsqltime 927.5µs
 
300 VUs
Connection Threads: min 3 max 5 current 3 idle 2 stopping 0 waiting 0
Request Handling:   requests 3, queued 0 (0.00%), spooled 2 (66.67%)
Request Timing: avg queue time 74.33µs, avg filter time 490.51ms, avg run time 19.45ms avg trace time 7.14ms
 
pool1:  statements 370.2K gethandles 2.9K handles 23 connected 7 used 20 waittime 0.246876 sqltime 371.667285 avgwaittime 85.4µs avgsqltime 1ms
 
150 VUs
Connection Threads: min 20 max 20 current 20 idle 17 stopping 0 waiting 0
Request Handling:   requests 4.7K, queued 0 (0.00%), spooled 3K (66.18%)
Request Timing: avg queue time 130.8µs, avg filter time 4.27ms, avg run time 201.81ms avg trace time 1.15ms
 
pool1:  statements 672.1K gethandles 4.9K handles 23 connected 15 used 20 waittime 0.248934 sqltime 686.242544 avgwaittime 51.2µs avgsqltime 1ms
 
70 VUs
Connection Threads: min 3 max 5 current 3 idle 2 stopping 0 waiting 0
Request Handling:   requests 5, queued 0 (0.00%), spooled 4 (80.00%)
Request Timing: avg queue time 73.2µs, avg filter time 296.22ms, avg run time 16.18ms avg trace time 4.63ms
 
pool1:  statements 764.7K gethandles 5.4K handles 23 connected 15 used 20 waittime 0.249513 sqltime 773.483103 avgwaittime 46µs avgsqltime 1ms
 
K6 summary 
running (4m09.2s), 000/300 VUs, 2757 complete and 0 interrupted iterations
default ✓ [======================================] 000/300 VUs  4m0s
 
     ✗ status is 200
      ↳  99% — ✓ 11022 / ✗ 2
     ✗ page succeeded
      ↳  99% — ✓ 11022 / ✗ 2
 
     █ setup
 
     █ teardown
 
     checks.........................: 99.98% ✓ 22044     ✗ 4
     data_received..................: 2.5 GB 10 MB/s
     data_sent......................: 9.7 MB 39 kB/s
     http_req_blocked...............: avg=80.69µs  min=0s     med=4.3µs    max=32.55ms  p(90)=6.31µs   p(95)=7.37µs
     http_req_connecting............: avg=11.55µs  min=0s     med=0s       max=28.93ms  p(90)=0s       p(95)=0s
   ✓ http_req_duration..............: avg=207.03ms min=0s     med=34.16ms  max=1.45s    p(90)=946.89ms p(95)=1s
       { expected_response:true }...: avg=207.05ms min=5.21ms med=34.17ms  max=1.45s    p(90)=946.89ms p(95)=1s
     http_req_failed................: 0.01%  ✓ 2         ✗ 16535
     http_req_receiving.............: avg=519.41µs min=0s     med=115.54µs max=63.53ms  p(90)=1.37ms   p(95)=1.56ms
     http_req_sending...............: avg=26.43µs  min=0s     med=23.84µs  max=855.82µs p(90)=43.09µs  p(95)=50.97µs
     http_req_tls_handshaking.......: avg=63µs     min=0s     med=0s       max=29.88ms  p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=206.48ms min=0s     med=33.99ms  max=1.45s    p(90)=945.49ms p(95)=1s
     http_reqs......................: 16537  66.369066/s
     iteration_duration.............: avg=11.23s   min=1.61µs med=11.24s   max=13.28s   p(90)=11.35s   p(95)=11.39s
     iterations.....................: 2757   11.064855/s
     vus............................: 1      min=1       max=300
     vus_max........................: 300    min=300     max=300
 

nocache = true

# Test #22   48 CPU's   200 Gig RAM
# Nginx with 3 Naviservers
#   maxconnections 1000
#   maxthreads 20  
#   minthreads 20 
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true 
#    image pool 6/6    
# ns_section ns/db/pool/pool1
    ns_param        connections        23     
#  DB on Same VM
# ns param nocache = true
 
150 VUs
Connection Threads: min 20 max 20 current 20 idle 15 stopping 0 waiting 0
Request Handling:   requests 836, queued 0 (0.00%), spooled 539 (64.47%)
Request Timing: avg queue time 100.37µs, avg filter time 6.93ms, avg run time 157.66ms avg trace time 1.25ms
 
pool1:  statements 117.6K gethandles 938 handles 23 connected 6 used 13 waittime 0.170185 sqltime 96.248608 avgwaittime 181.4µs avgsqltime 818.6µs
 
300 VUs
Connection Threads: min 20 max 20 current 20 idle 10 stopping 0 waiting 0
Request Handling:   requests 2.8K, queued 9 (0.33%), spooled 2K (64.85%)
Request Timing: avg queue time 166.06µs, avg filter time 7.66ms, avg run time 187.04ms avg trace time 1.45ms
 
pool1:  statements 407.8K gethandles 2.9K handles 23 connected 8 used 21 waittime 0.267667 sqltime 375.288688 avgwaittime 93.4µs avgsqltime 920.3µs
 
150 VUs
Connection Threads: min 20 max 20 current 20 idle 12 stopping 0 waiting 0
Request Handling:   requests 4.7K, queued 10 (0.21%), spooled 3K (65.63%)
Request Timing: avg queue time 167.3µs, avg filter time 7.76ms, avg run time 202.97ms avg trace time 1.51ms
 
pool1:  statements 739K gethandles 4.8K handles 23 connected 12 used 21 waittime 0.269563 sqltime 694.840813 avgwaittime 56.1µs avgsqltime 940.3µs
 
70 VUs
Connection Threads: min 20 max 20 current 20 idle 18 stopping 0 waiting 0
Request Handling:   requests 5.3K, queued 10 (0.19%), spooled 3K (65.96%)
Request Timing: avg queue time 160.41µs, avg filter time 7.7ms, avg run time 205.11ms avg trace time 1.5ms
 
pool1:  statements 846.7K gethandles 5.4K handles 23 connected 17 used 21 waittime 0.270101 sqltime 787.473052 avgwaittime 50.2µs avgsqltime 930.1µs
 
 
K6 Summary
running (4m10.2s), 000/300 VUs, 2736 complete and 0 interrupted iterations
default ✓ [======================================] 000/300 VUs  4m0s
 
     ✓ status is 200
     ✓ page succeeded
 
     █ setup
 
     █ teardown
 
     checks.........................: 100.00% ✓ 21888     ✗ 0
     data_received..................: 2.5 GB  10 MB/s
     data_sent......................: 9.6 MB  38 kB/s
     http_req_blocked...............: avg=81.66µs  min=1.6µs   med=4.39µs   max=39.67ms  p(90)=6.49µs   p(95)=7.68µs
     http_req_connecting............: avg=12.47µs  min=0s      med=0s       max=35.3ms   p(90)=0s       p(95)=0s
   ✓ http_req_duration..............: avg=218.86ms min=7.34ms  med=37.47ms  max=1.55s    p(90)=978.96ms p(95)=1.04s
       { expected_response:true }...: avg=218.86ms min=7.34ms  med=37.47ms  max=1.55s    p(90)=978.96ms p(95)=1.04s
     http_req_failed................: 0.00%   ✓ 0         ✗ 16418
     http_req_receiving.............: avg=526.47µs min=33.71µs med=121.17µs max=22.53ms  p(90)=1.39ms   p(95)=1.58ms
     http_req_sending...............: avg=27.35µs  min=9.2µs   med=24.76µs  max=188.41µs p(90)=45.25µs  p(95)=53.62µs
     http_req_tls_handshaking.......: avg=62.98µs  min=0s      med=0s       max=30.12ms  p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=218.3ms  min=7.26ms  med=37.27ms  max=1.55s    p(90)=977.52ms p(95)=1.04s
     http_reqs......................: 16418   65.62951/s
     iteration_duration.............: avg=11.3s    min=1.7µs   med=11.31s   max=13.26s   p(90)=11.45s   p(95)=11.49s
     iterations.....................: 2736    10.936919/s
     vus............................: 1       min=1       max=300
     vus_max........................: 300     min=300     max=300

Also, thanks for adding the cachingmode. It would be good to run some tests on that as well. We would appreciate any help and insights you can give on how to setup a special cachingmode for clustering. One thing we are wanting to test was adding the ::acs::clusterwide to the nsv_unset in ad_parameter_cache. Are there other places you would recommend us adding the ::acs::clusterwide to, for our testing of the cluster? Or any other directions you might suggest?

Thanks for all you help, -Marty
Hi Gustaf,

We got a signal 11 core today on our live system. I was able to connect up to it with gdb and get a backtrace. Here is the backtrace.

Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/ns/bin/nsd...
...
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type RET for more, q to quit, c to continue without paging--
Core was generated by `/usr/local/ns/bin/nsd -u root -g web -i -t /web/etc/config.tcl'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI_abort () at abort.c:107
107     abort.c: No such file or directory.
[Current thread is 1 (Thread 0x7fcc74ff9700 (LWP 37))]
(gdb) bt
#0  __GI_abort () at abort.c:107
#1  0x00007fccba4b3064 in skgdbgcra () from /opt/oracle/instantclient_21_1/libclntsh.so.21.1
#2  0x00007fccba480693 in skgesigCrash () from /opt/oracle/instantclient_21_1/libclntsh.so.21.1
#3  0x00007fccba4809de in skgesig_sigactionHandler () from /opt/oracle/instantclient_21_1/libclntsh.so.21.1
#4  signal handler called
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007fccc2292859 in __GI_abort () at abort.c:79
#7  0x00007fccc24cf27f in Panic (fmt=optimized out) at log.c:943
#8  0x00007fccc215bf8d in Tcl_PanicVA () from /usr/local/ns/lib/libtcl8.6.so
#9  0x00007fccc215c0ff in Tcl_Panic () from /usr/local/ns/lib/libtcl8.6.so
#10 0x00007fccc25143ca in Abort (signal=optimized out) at unix.c:1119
#11 signal handler called
#12 0x00007fccc24b825c in NsTclConnChanProc (UNUSED_sock=optimized out, arg=0x7fcbf581d460, why=1) at connchan.c:694
#13 0x00007fccc24e279d in SockCallbackThread (UNUSED_arg=optimized out) at sockcallback.c:531
#14 0x00007fccc2210fd4 in NsThreadMain (arg=optimized out) at thread.c:232
#15 0x00007fccc221169f in ThreadMain (arg=optimized out) at pthread.c:870
#16 0x00007fccc1b72609 in start_thread (arg=optimized out) at pthread_create.c:477
#17 0x00007fccc238f293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Any Ideas on what could be causing the signal 11 issue? If you want me to open a new discussion on it I can.. Thanks! Marty
Concerning the segfault: what version of NaviServer are you using? From the line numbers, it can't be a recent NaviServer. It might come from 4.99.19 (released 2020-01-06) or earlier, but note that the released version if NaviServer is already 4.99.22 (released 2021-08-25). The newer versions have this code rewritten, so there is some likelyhood that the segfault will go away with the currently released version.
Thanks Gustaf,

We will work on getting on this latest release for our live system.

Hi Gustaf,

In looking closer at the docker image that I build with naviserver it appears that we are built with naviserver-4.99.22. In my /usr/local/src/ directory the 'wget' from 'install-ns.sh build' pulled down naviserver-4.99.22.tar.gz and unpacked it and compiled it.

Once this was done I ran the commands you gave me to clean and compile again with debug symbols

cd  /usr/local/src/naviserver-4.99.22
make clean all install "CFLAGS_OPTIMIZE=-O2 -g"
 
cd /usr/local/src/tcl8.6.11/unix
make clean all "CFLAGS_OPTIMIZE=-O2 -g"
(cd  /usr/local/src/tcl8.6.11/unix/pkgs/thread2.8.6; make clean all "CFLAGS_DEFAULT=-O2 -g -DNS_AOLSERVER=1 -I/usr/local/ns/include")
make install

Also when we hit our live site with Chrome browser it reports in the HTML headers that it is using Server: NaviServer/4.99.22.

In our docker image (which has starting base of Ubuntu 20.04) I also brought down gdb debugger and that is what I used to get the backtrace along with the nsd that was compiled and installed in /usr/local/ns/bin/nsd

 gdb /usr/local/ns/bin/nsd /usr/local/ns/core_1/core.1

Was this the right nsd executable to run with the core file? Any Idea how could we have gotten the wrong debug symbols?

What you did looks correct to me... but maybe, there as an old instance running, ... or can it be you picked an old core file?

My doubts come from the fact that the traceback shows that the signal is generated from connchan.c in line 694

#12 0x00007fccc24b825c in NsTclConnChanProc (UNUSED_sock=optimized out, arg=0x7fcbf581d460, why=1) at connchan.c:694
but this can't be from 4.99.22 (see [1]) but rather looks to be from 4.99.19 [2] or older. I've double-checked, the tar file delivered from sourceforge is correct.

[1] [https://bitbucket.org/naviserver/naviserver/src/naviserver-4.99.22/nsd/connchan.c#lines-694
[2] [https://bitbucket.org/naviserver/naviserver/src/naviserver-4.99.19/nsd/connchan.c#lines-694

I double checked our connchan.c and it is in deed the right file for 4.99.22. Line 694 is the "return success" in NsTclConnChanProc function on our system too.

I believe I have the right executable and core file for it

I found that I can do a 'bt full' with gdb and it gives you the variables as well. It looks like servPtr is set to a memory location that it does not own in the NsTclConnChanProc function.

It looks like gdb is not returning the exact line number of the NsTclConnChanProc function where the servPtr had the problem. But it does look like it has all of the other information correct

#10 0x00007fccc25143ca in Abort (signal=optimized out) at unix.c:1119
No locals.
#11 signal handler called
No locals.
#12 0x00007fccc24b825c in NsTclConnChanProc (UNUSED_sock=optimized out, arg=0x7fcbf581d460, why=1) at connchan.c:694
        whenBuffer = "r\000\314\177\000"
        servPtr = error reading variable servPtr (Cannot access memory at address 0x646e65732d203140)
        cbPtr = 0x7fcbf581d460
        success = true
#13 0x00007fccc24e279d in SockCallbackThread (UNUSED_arg=optimized out) at sockcallback.c:531
        stop = optimized out
        pollTimeout = optimized out
        nfds = optimized out
        now = {sec = 1632772110, usec = 116003}
        diff = {sec = 0, usec = 0}
        c = 0 '\000'
        when = {32716, 2, 20}
        events = {1, 4, 2}
        n = optimized out
        i = 0
        isNew = 0
        maxPollfds = 100
        cbPtr = 0x7fcbf539f940
        nextPtr = optimized out
        hPtr = optimized out
        search = {tablePtr = 0x7fccc2565640 activeCallbacks, nextIndex = 12, nextEntryPtr = 0x0}
        pfds = 0x7fcc640012b0
#14 0x00007fccc2210fd4 in NsThreadMain (arg=optimized out) at thread.c:232
        thrPtr = 0x7fcc94f231e0
#15 0x00007fccc221169f in ThreadMain (arg=optimized out) at pthread.c:870
--Type RET for more, q to quit, c to continue without paging--1
No locals.
#16 0x00007fccc1b72609 in start_thread (arg=optimized out) at pthread_create.c:477
        ret = optimized out
        pd = optimized out
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140516112963328, 1098100496289630482, 140517183382942, 140517183382943, 140517183383088, 140516112961280, -1106243481039927022,
                -1106548523360057070}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = 0
#17 0x00007fccc238f293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It looks like gdb is not returning the exact line number of the NsTclConnChanProc

Probably. Maybe, the function pointer and optimization contributes to the confusion. One should turn off optimization and turn on assertions, so it becomes easier to identify the root cause.

Since the flag "-DNDEBUG" is inherited from Tcl, the easiest thing is to remove the flag manually from the Makefile, eg. with sed, as shown below. Note that i have as well adjusted the optimize flags.

All the best
-g

cd  /usr/local/src/naviserver-4.99.22
sed -i 's/-DNDEBUG//'  include/Makefile.global
make clean all install "CFLAGS_OPTIMIZE=-O0 -g" 
Hi Marty,

concerning your original question concerning clustering, I've done some work over the weekend. If one is using a recent version of NaviServer (at least 4.99.20), the parameter::* interface should be now cluster proof (added acs::clusterwide for cleaning and callbacks), and does no caching, when the cachingmode is set to "none" When the regression test is run with cachingmode set to "none". I've also cleaned up xo::clusterwide vs. acs::clusterwide (the former is gone). When running the regression test with cachingmode "none", some errors show up, but these are probably just due to the testing whether caching works (not sure about the client-property errors, but I am running right now out of time).

Concerning cluster modes: i see the following directions:

Classical OpenACS clustering: The basic idea is to setup intra-server talk to achieve cache-coherency. When some data is flushed in the cache of one server, it communicates this to the other servers in the cluster (via HTTP request). This is a multi-master mode, where every server talks to the same database. The NaviServers are symmetrical, one is responsible for scheduled procedures. When files are stored in the content repository, the content repository directory has to be shared between the instances. OpenACS 5.10 contains experimental support for storing content repository files in gridfs (which is horizontally scaling, based on mongodb). Some older packages bypassing the content-repository API have still to be adjusted to use the API.

However, the classical OpenACS clustering mode was never fully implemented, but might still work for certain applications. One could extend this clustering mode and make it "work as designed". Several cluster-flush operations are missing, probably some callbacks have to executed cluster-wide as well (e.g., change parameters), etc.

Known weaknesses: potential issue with scalability - when e.g., a single request requires e.g. 100 flush operations (not unusual), and e.g. 10 nodes are defined in the cluster, a single incoming request will cause 1000 additional HTTP requests for flushing. This can be probably improved via batching (bundle multiple flush requests in one HTTP request). Since the requests are HTTP requests, these will need very short timeouts, otherwise, these will block a while, when some nodes are down, slow, some connections might "hang" in the middle of a request, etc. After a network separation, full-cache-flush is required (but is not implemented).

Classical OpenACS clustering via UDP: This can reduce the clusterwide operation latency (no TCP setup times); one could consider multicast operations in the cluster to reduce the intra-server traffic substantially - the basic architecture will be the same.

Quick test for intra-server talk to a server on the same host (my notebook, localhost, mean value from 1000 requests)

   TCP 0.273ms
   UDP 0.171ms
   UDP 0.059ms (no reply)
So, flushing operations can be made faster by a factor of 4 or more by using UDP instead of TCP.

Clustering with no-caching: if nothing is cached (except maybe per-request caches), there will be no problem with cache-coherency, but there will be more load on the DB. The big advantage of no-caching is that the intra-server talk to preserve cache coherency is zero. For several types of applications, this will be the simplest setup.

Clustering via caching server: One could set up a system, where the caching is not done via a (central) caching server (e.g., REDIS; memcached, or a special purpose NaviServer) instead of NaviServer native caching. If there is a central cache, no intra-server talk is needed to achieve cache coherency, Disadvantages: NaviServer native caching is significantly more efficient than memcached/REDIS, and supports cache transactions (rollback, commit of cache settings). One could get this back via using one NaviServer instance as a purely caching instance ... but this requires deeper analysis. Some figures:

   NaviServer native cache performance: 0.33us == 0.00033ms
       time {ns_cache_eval ns:memoize 1 {set x 1}} 100000 

   REDIS
      https://k6.io/blog/benchmarking-redis-with-k6/

        iteration_duration...: avg=782.57µs min=67.35µs med=732.92µs max=15.86ms p(90)=1.1ms p(95)=1.3ms
        iterations...........: 506755 50660.636169/s
        redis_latency........: avg=764.8µs  min=0s      med=1ms     max=16ms    p(90)=1ms   p(95)=1ms

      100VUs
        iteration_duration...: avg=4.39ms min=46.8µs med=3.32ms max=87.24ms p(90)=9.5ms p(95)=12.51ms

When assuming 1ms latency for external cache, NaviServer's internal cache is about 3000x faster than going to an external cache, so, when intra-server talk is not overwhelming, one should get the best performance from OpenACS clustering via UDP - unless more options pop up.... but getting the majority of the OpenACS packages "cluster-safe" will take some time and debugging.

Thanks Gustaf!

I am very interested in testing out the "Clustering with no-caching=none" option. From what you have mentioned above it sounds like this option currently exists in 4.99.22 now. Is this correct? If so, I could pull down the latest 4.99.22 and run a k6 test with cachingmode=none and setup the clustering parameters to set a baseline. Then maybe compare that to running with no cachingmode set.

The other options also would be interesting to test out.

Thanks for taking the time to implement the cachingmode=none

Sincerely, Marty
it sounds like this option currently exists in 4.99.22 now. Is this correct?

No-caching (cachingmode=none) requires the head version of NaviServer from the bitbucket repository. To get this, activate in install_ns.sh "version_ns=HEAD".

The other options also would be interesting to test out.
I'll keep you updated when this progresses. For us the winter semester starts right now, so this is a very busy time.

Thanks for taking the time to implement the cachingmode=none
Note that you need as well the newest code from OpenACS from the oacs-5-10 branch. Let me know, how this works out for you. I would not be surprised, if there is some more caching that should be bypassed, but such cases should be relatively easy to fix.
To ease the estimation, how much the deactivation of ns_cache might cost for a particular OpenACS instance, i've added a summary line to nsstats (in the "Cache" statistics page, committed on bitbucket). E.g. on OpenACS the updated statistics say currently:

ns_cache operations saved since the start of the server 18m 39s on 286.69K requests (3.9ms per request on average)

3.9 ms is the average value on requests, including static pages, for which the impact is most likely lower than on content pages with caching from DB requests. Some particular caches, which are not used on every request, might yield on a hit much larger savings, so there will be some spread.

Thanks for adding the nsstats feature for average time saved with caching. I will have to check it out

I found that the proc ad_canonical_server_p found in server-cluster-procs.tcl, did not work for us as we are using https (port 443). Included below is a diff of how I made it work. Feel fee to modify as needed. Note: We use nsssl_v4 in our config.tcl

Thanks, Marty
diff --git a/packages/acs-tcl/tcl/server-cluster-procs.tcl b/packages/acs-tcl/tcl/server-cluster-procs.tcl
index 5cafb058d..882bd562a 100644
--- a/packages/acs-tcl/tcl/server-cluster-procs.tcl
+++ b/packages/acs-tcl/tcl/server-cluster-procs.tcl
@@ -120,13 +120,26 @@ ad_proc -private ad_canonical_server_p {} {
         set canonical_ip $canonical_server
     }

-    set driver_section [ns_driversection -driver nssock]
+    if { $canonical_port eq "443" } {
+        set driver_section [ns_driversection -driver nsssl_v6]
+    } else {
+        set driver_section [ns_driversection -driver nssock_v6]
+    }
+
+    if { $driver_section eq "" && $canonical_port eq "443" } {
+        set driver_section [ns_driversection -driver nsssl_v4]
+    } else {
+        set driver_section [ns_driversection -driver nssock_v4]
+    }
+
     if { [ns_config $driver_section Address] == $canonical_ip
-         && [ns_config $driver_section Port 80] == $canonical_port
+         && [ns_config $driver_section Port] == $canonical_port
      } {
+        ns_log notice "Is CanonicalServer"
         return 1
     }

+    ns_log notice "Not CanonicalServer"
     return 0
 }
Hi Marty,

Concerning the the driver names (nssock_v4, nssock_v4, nsssl_v4, nsssl_v6): the names *v4 and *v6 are deprecated and not used in the recommended configuration file: the plain communication is just called "nssock", secure is "nsssl" [1]. The necessity for that somewhat hacky suffix was removed with the introduction of NaviServer 4.99.19 (1.5 years ago) which supports multiple IP addresses registered to a single driver [2]. Unless you have to support older versions of NaviServer, i would recommend to switch to the prefix-less version.

When i have some more time, i should setup a cluster for testing, i see many things in the old server-cluster-procs.tcl which should be streamlined and modernized. The code with the prefixes should not go to the common repository.

In case you want to make the intra-server talk via SSL, this will slowdown the communication relative to the figures i've posted above. So, i would not recommend it. But certainly, in case a company policy requires that all communications (including intranet) have to be secure, the code should support it. Otherwise restricting access to the HTTP port via firewall allowing via firewall for a few selected hosts should be ok.

-g

[1] https://bitbucket.org/naviserver/naviserver/src/naviserver-4.99.22/openacs-config.tcl
[2] https://bitbucket.org/naviserver/naviserver/src/naviserver-4.99.22/NEWS#lines-489

Thanks Gustaf,

I was not aware of those changes to config.tcl. Thanks for pointing out the configuration changes and providing an example of an up to date config.tcl configuration.

Our company does require https for our web server communications so we will keep that in mind with our testing.

We appreciate whatever help you can give, knowing that this your busy time,

Thanks,
Marty

Hi Gustaf,
Note that you need as well the newest code from OpenACS from the oacs-5-10 branch. Let me know, how this works out for you.

Do I need to download it directly from CVS somehow. I am unsure how to download the latest oacs 5.10 from CVS.

I went to https://openacs.org/projects/openacs/download/ and downloaded the openacs 5.10 and 5.10 beta tar files but it does not look like the change you were talking about were there.

parameter::* interface should be now cluster proof (added acs::clusterwide for clearning and callbacks)
I was expecting to see an acs::clusterwide inside the ad_parameter_cache proc. Where exactly did you make your changes?. Marty.
When you install via install-oacs.sh, set in this script oacs_core_tag and oacs_packages_tag to oacs-5-10 (this is the newest version of the oacs-5-10 branch) and set oacs_tar_release_url to empty.
These are the changes to check to be sure that you have the right version... -gn

[1] https://cvs.openacs.org/changelog/OpenACS?cs=oacs-5-10%3Agustafn%3A20211004093700

Dear Marty,

Just now, I have committed an overhauled and streamlined version of the caching infrastructure to oacs-5-10 in CVS. Everything uses now the new infrastructures, the code should be now more in line with modern NaviServer capabilities (IPv6, multiple IP addresses and multiple ports per server, global vs. local network drivers, etc.).

Since the cluster servers are provided as IP addresses, there is no configuration about the protocol (HTTP vs. HTTPS); so for the time being, all servers have to listen (in addition to the public port) on a plain HTTP port for cluster operations. This restriction can be lifted probably in the future, but first, the basics should work correctly.

In my setup, i have set essentially the CanonicalServer to"127.0.0.1:8100" and ClusterPeerIP to "127.0.0.1:8100 127.0.0.1:8101 127.0.0.1:8102", which works now essentially. When you test the cluster configuration, please update to the newest code in oacs-5-10 to avoid the problems i faced.

all the best
-g

Thanks Gustaf,

I will pull the latest and try it. I did test the pervious version with cachingmode=none with the latest openACS 5.10 changes as of ~Oct 10th. It worked great from my testing. 1.59s in the 95th percentile. So I thought that was great.

I also verified that there was no caching with parameters

I will let you know what I find with the latest pull.

Thanks! Marty