Forum OpenACS CMS: Configuring OpenACS for optimum performance

I am trying to find out how to best configure NaviServer and OpenACS for a system that hosts home grown business applications for 1000 concurrent users. We have used Naviserver and OpenACS for years with very good success, but have started seeing issues as our user base has grown past 1000 users.

We are currently running on OpenACS 5.9.1 and NaviServer 4.99.15 on VM architecture with 164 GB RAM and 8 processor cores.

The issues we are seeing include:

We see a lot of slowness and in many cases requests that completely block others. We can see where two requests (non-db related) completely block all other requests. To see this we have written a script that makes several requests to the server for a couple of pages of varying run times : one that has some db calls that takes about 3 seconds to render and one that has no db calls and a tcl wait for 20 seconds. We have 10 connection threads configured (see config info below). It seems like all 10 of these should be able to get their own thread and they all would run at the same time. What we see is that the requests get served two at a time: 2 come back in 20 seconds, then 2 come back 20 seconds after that, and so on. We have been working on moving to the NaviServer 4.99.18 with OACS 5.10 where we do see a little better results, but very similar, and we still wonder why this occurs. See below for specific results on these mentioned test scripts.
nsstats sometimes spins for more than 8 or 10 seconds before the page is returned. Being that we have all of these requests served by a separate server pool and I am the only one that is hitting it during my testing, I'm not sure why it would ever take so long to respond.
When we get under heavy load we start to see a proliferation of errors in the log: Error: socket(##): setsockopt(TCP_CORK) 0[or 1]: Invalid argument
Just recently we had a major slow down on our server where we tried to serve many video files and had some major slowdown. We have these being served via a separate server pool. The slowness was sometimes observed in initial startup of the video as well as buffering later on during the video playback. I am not sure of the full mechanics of a video stream, but I can see the file being served via a writer at the beginning and then, with subsequent reloads of the nsstats?_process page, the download is no longer shown, but I can still see requests from the browser obtaining subsequent segments of the video.

Could you give some direction on bettering our configuration, understanding how requests are handled, and why it seems like the server just stalls?

Is there a built-in way to turn on a more consistant / persistant logging of the values provided in nsstats that would help diagnose these problems and what are the best values to be observing?

Thanks. Any help is greatly appreciated.

Below you will find server run-time and configuration information that, hopefully, will convey, more precisely, what we are seeing and help with your response.

Here are the results of three runs of our request tests that show the blocking of server requests. Each line represents either the request being made (or returning) and states which page was requested for a request and which child process and the elapsed time from the beginning of the run for a return.

# This one fires all 10 forks within milliseconds – each page sleeps for 20 seconds
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Child (126632) Elapsed time = 20 seconds.
Child (126624) Elapsed time = 20 seconds.
Child (126625) Elapsed time = 40 seconds.
Child (126627) Elapsed time = 40 seconds.
Child (126630) Elapsed time = 60 seconds.
Child (126637) Elapsed time = 60 seconds.
Child (126622) Elapsed time = 80 seconds.
Child (126623) Elapsed time = 80 seconds.
Child (126626) Elapsed time = 100 seconds.
Child (126634) Elapsed time = 100 seconds.
 
# This one fires all 10 forks within milliseconds – first two pages are sleep 20, followed by 8 home pages that take ~3 seconds each
Sleep 20 Page
Sleep 20 Page
Home page
Home page
Home page
Home page
Home page
Home page
Home page
Home page
Child (49600) Elapsed time = 3 seconds.
Child (49607) Elapsed time = 3 seconds.
Child (49598) Elapsed time = 5 seconds.
Child (49595) Elapsed time = 8 seconds.
Child (49602) Elapsed time = 10 seconds.
Child (49596) Elapsed time = 13 seconds.
Child (49593) Elapsed time = 23 seconds.
Child (49604) Elapsed time = 25 seconds.
Child (49606) Elapsed time = 28 seconds.
Child (49594) Elapsed time = 33 seconds.
 
# In this script I sleep 5 seconds in between each fork()
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Child (94882) Elapsed time = 20 seconds.
Sleep 20 Page
Child (94951) Elapsed time = 20 seconds.
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Child (95045) Elapsed time = 30 seconds.
Sleep 20 Page
Child (95130) Elapsed time = 30 seconds.
Child (95245) Elapsed time = 40 seconds.
Child (95382) Elapsed time = 40 seconds.
Child (95436) Elapsed time = 50 seconds.
Child (95485) Elapsed time = 50 seconds.
Child (95558) Elapsed time = 60 seconds.
Child (95635) Elapsed time = 60 seconds.

Here is example output of the nsstats?_page=process.

Key             |    Value
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Host            | serv (x.x.x.21) 
Boot Time       | Thu Jul 2 19:41:25 2020 
Uptime          | 3 days 16:31:53 (h:m:s) 
Process         | 22310 /.../nsd 
Home            | /.../ns 
Configuration   | /...serv.tcl 
Error Log       | /.../serv-error.log 
Log Statistics  | Notice 26K Warning 62K Error 3136 Fatal 0 Bug 0 Debug 0 Dev 0 Debug(ns:driver) 0 Debug(task) 0 Debug(request) 0 Debug(connchan) 0 Debug(sql) 0 Debug(nsproxy) 0 
Version         | 4.99.15 (tag tar-4.99.15)) 
Build Date      | Nov 3 2017 at 14:57:26 
Servers         | serv 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
Driver Info     | thread nssock:0 module nssock received 35 spooled 0 partial 61 errors 0 
thread          | nsssl:0 module nsssl received 220K spooled 0 partial 489K errors 452 
                |       
Certificate     | /....... 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
DB-Pools    
pool1:          | statements 15M gethandles 206K handles 30 used 25 waittime 7.708260 sqltime 85906.834913 avgwaittime 37µ avgsqltime 6m 
pool2:          | statements 291K gethandles 16K handles 5 used 5 waittime 26.012233 sqltime 845.943847 avgwaittime 2m avgsqltime 3m 
pool3:          | statements 0 gethandles 2 handles 5 used 1 waittime 0.008694 sqltime 0.000000 avgwaittime 4m 
ProdPool:       | statements 0 gethandles 50 handles 5 used 3 waittime 0.416256 sqltime 0.000000 avgwaittime 8m 
PprdPool:       | statements 0 gethandles 4 handles 5 used 2 waittime 0.034431 sqltime 0.000000 avgwaittime 9m 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
Callbacks       | 
prestartup:     | ns:tclcallback  \_ns_load_global_modules 1 
signal:         | p:fffffd74172d70c0  a:0 
shutdown:       | nsproxy:shutdown    a:0 
shutdown:       | nslog:close /....log 
Socket Callbacks

 Key                |  Value          
--------------------------------------------------------------------------------------------------------------------- 
 Address            |  x.x.x.21:80 
                    | x.x.x.21:443 
 Server Directory   |    /.../ns 
 Page Directory     | /... 
 Tcl Library        | /...serv/tcl 
 Access Log         | /.../serv.log 
 Writer Threads     | nssock: 2 nsssl: 5 
 Connection Pools   | image monitor {} 
-------------------------------------------------------------------------------------------------------------------
Pool                | 'default'   
Connection Threads: | min 2 max 10 current 5 idle 4 stopping 0 waiting 0 
Request Handling:   | requests 150K, queued 12K (8.22%), spooled 91K (60.60%) 
Request Timing:     | avg queue time 0.0194s, avg filter time 0.0028s, avg run time 0.2549s avg trace time 0.0007s 
Active Requests:    | cns220472 x.x.x.39 running GET /hr/app/pictures/item/ 0.179676 0 . 
-------------------------------------------------------------------------------------------------------------------
Pool                | 'image'     
Connection Threads: | min 2 max 10 current 4 idle 4 stopping 0 waiting 0 
Request Handling:   | requests 66K, queued 6K (8.90%), spooled 59K (88.83%) 
Request Timing:     | avg queue time 0.0290s, avg filter time 0.0036s, avg run time 0.1405s avg trace time 0.0009s 
Active Requests:    |  
Mapped:             | 13 
------------------------------------------------------------------------------------------------------------------
Pool                | 'monitor'   
Connection Threads: | min 2 max 2 current 2 idle 1 stopping 0 waiting 0 
Request Handling:   | requests 4.6K, queued 0 (0.00%), spooled 8 (0.18%) 
Request Timing:     | avg queue time 0.0001s, avg filter time 0.0035s, avg run time 0.0035s avg trace time 0.0004s 
Active Requests:    | cns220474 x.x.x.113 running GET /admin/nsstats 0.037008 0 . 
Mapped:             | 4 
-------------------------------------------------------------------------------------------------------------------- 
Active Writer Jobs  1594059180.021175 -writer0- nsssl:0 x.x.x.37 -1 2902395 2687156 {281529 /parc/data/program-data}

Here are portions of our configuration:

#---------------------------------------------------------------------
#
# SSL
#
#---------------------------------------------------------------------
 
ns_section    "ns/server/${server}/module/nsssl"
    ns_param         address         $address
    ns_param         hostname        $hostname
    ns_param         port            $httpsport
    ns_param         Protocols       "!SSLv2:!SSLv3"
    # ns_param        CipherSuite     "ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP"
    ns_param         CipherSuite     "ALL:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP"
    # ns_param      ciphers              "ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!RC4"
    # ns_param                ModuleDir       $serverroot/etc/certs
    ns_param         ModuleDir       /.../certs
    # ns_param         certificate     /.../mdc.pem
    ns_param         certificate     /....pem
    ns_param         verify          0
    ns_param         writerthreads   5
    ns_param         writersize      1024
    ns_param         writerbufsize   16384   ;# 8192, buffer size for writer threads
    ns_param         writerstreaming true    ;# false
    ns_param         deferaccept     true    ;# false, Performance optimization
    ns_param         maxinput        [expr {$max_file_upload_mb * 1024*1024}] ;# Maximum File Size for uploads in bytes
 
#
# Server parameters
#
ns_section ns/server/${server}
    #
    # Scaling and Tuning Options
    #
    # ns_param      maxconnections  100     ;# 100; number of allocated connection stuctures
    # ns_param      maxthreads      5       ;# 10; maximal number of connection threads
    ns_param        minthreads      2       ;# 1; minimal number of connection threads
    ns_param        connsperthread  10000   ;# 10000; number of connections (requests) handled per thread
    # ns_param      threadtimeout   120     ;# 120; timeout for idle theads
    # ns_param      lowwatermark    10      ;# 10; create additional threads above this queue-full percentage
    ns_param        highwatermark   100     ;# 80; allow concurrent creates above this queue-is percentage
                                                ;# 100 means to disable concurrent creates
 
#
# Server pools
# added 2020-02-26 to try to address server responsiveness issues
#
ns_section ns/server/${server}/pools
    ns_param        monitor         "Monitoring actions to check the health of the system"
    ns_param        image           "Image file requests"
ns_section ns/server/${server}/pool/monitor
    ns_param        minthreads 2
    ns_param        maxthreads 2
 
    ns_param        map "GET /admin/nsstats"
    ns_param        map "GET /SYSTEM"
    ns_param        map "GET /ds"
    ns_param        map "POST /ds"
ns_section ns/server/${server}/pool/image
    ns_param        minthreads 2
    # ns_param        maxthreads 2
 
    ns_param        map "GET /*.png"
    ns_param        map "GET /*.jpg"
    ns_param        map "GET /*.PNG"
    ns_param        map "GET /*.JPG"
    ns_param        map "GET /*.JPEG"
    ns_param        map "GET /*.jpeg"
    ns_param        map "GET /*.mp4"
    ns_param        map "GET /*.MP4"
    ns_param        map "GET /*.pdf"
    ns_param        map "GET /*.PDF"
    ns_param        map "GET /*.gif"
    ns_param        map "GET /*.GIF"
    ns_param        map "GET /doc-control/view/stored-record" ; # This is another url that serves .mp4 via a .vuh

The database pools

We are using PostgreSQL 10.0 with

pool1 30 connections
pool2 5 connections
pool3 5 connections

We also have other connections to an Oracle instance.