Forum OpenACS Q&A: Re: OpenACS clustering setup and how it relates to xotcl-core.

Thanks Gustaf,

I did set the log_min_duration_statement to 1 second and found some queries to optimize. I then ran our K6 load test and we are down to 1.01 seconds in the 95th percentile! :)

I very much appreciate your master class on tuning naviserver.

Now that we have optimized one instance. I did run the nginx with 3 naviservers test and found that we are geting the same times with nginx serving 3 naviservers with the following config on each of the 3 naviservers. However, I had not put in the clustering yet with that test.

# 48 CPU's   200 Gig RAM
# Nginx with 3 Naviservers
#   maxconnections 1000
#   maxthreads 20
#   minthreads 20 
#   connsperthread 10000
#   highwatermark 100
#   compressenable  off
#   rejectoverrun true 
#    image pool 6/6 
# ns_section ns/db/pool/pool1
	ns_param        connections        23  
#  DB on Same VM

There were three main reason for us looking into nginx in the beginning of this.

To get over the 'All available connections are used up' errors that we would see from time to time. This is now solved! thank you :)
If one naviserver fails (core dumps with signal 11) and restarts, nginx just marks that server as down and skips it in the round robin and we see no downtime
Nginx can be set up to allow non-schema change upgrades to happen in the middle of the working day without us restarting the server and therefore seeing no downtime for our users. See: https://www.nginx.com/faq/how-does-zero-downtime-configuration-testingreload-in-nginx-plus-work/. Our developers and users would like this because we could make a bug fix and push it out during the day without restarting naviserver which can take about 1 minute to restart

Let me explain number 2 a bit more, in case you have seen this error before. If you want me to open a new discussion thread I can if you like. We have for some years now been getting a signal 11 core dump from time to time. It has not been too big of a deal because naviserver would restart by it's self and we would only be down for 30 seconds or so back then. However, over the past year we have noticed it happens with more frequency (probably 3 or 4 times a week) but we have not been able to track it down. In the log file the core dump looks like the following:

...
[16/Sep/2021:08:59:39][1.7f36daffd700][-socks-] Fatal: received [16/Sep/2021:08:59:39][1.7f36e3fff700][-driver:nsssl_v4:0-] Notice: ... sockAccept accepted 2 connections
[16/Sep/2021:08:59:39][1.7f36e3fff700][-driver:nsssl_v4:0-] Notice: ... sockAccept accepted 2 connections
[16/Sep/2021:08:59:39][1.7f36daffd700][-socks-] Fatal: received fatal signal 11
ORA-24550: signal received: [si_signo=6] [si_errno=0] [si_code=-6] [si_int=0] [si_ptr=(nil)] [si_addr=0x1]
kpedbg_dmp_stack()+396 -kpeDbgCrash()+204 -kpeDbgSignalHandler()+113 -skgesig_sigactionHandler()+258 -__sighandler() -gsignal()+203 -abort()+299[16/Sep/2021:08:59:48][1.7efe2e4f6800][-main:conf-] Notice: nsmain: NaviServer/4.99.21 (tar-4.99.21) starting
...

As you can see it is an oracle error from what looks like the oracle driver. Our main DB is postgres (99%) but we also connect up to an oracle DB for certain queries.

Again I can open another discussion for this item if you like. Just wanted you to know the three reasons we were looking into nginx.

Thanks again, for your willingness to share your expertise and for helping us scale/tune our server!

Sincerely, Marty