Forum OpenACS CMS: Configuring OpenACS for optimum performance

I am trying to find out how to best configure NaviServer and OpenACS for a system that hosts home grown business applications for 1000 concurrent users. We have used Naviserver and OpenACS for years with very good success, but have started seeing issues as our user base has grown past 1000 users.

We are currently running on OpenACS 5.9.1 and NaviServer 4.99.15 on VM architecture with 164 GB RAM and 8 processor cores.

The issues we are seeing include:

  • We see a lot of slowness and in many cases requests that completely block others. We can see where two requests (non-db related) completely block all other requests. To see this we have written a script that makes several requests to the server for a couple of pages of varying run times : one that has some db calls that takes about 3 seconds to render and one that has no db calls and a tcl wait for 20 seconds. We have 10 connection threads configured (see config info below). It seems like all 10 of these should be able to get their own thread and they all would run at the same time. What we see is that the requests get served two at a time: 2 come back in 20 seconds, then 2 come back 20 seconds after that, and so on. We have been working on moving to the NaviServer 4.99.18 with OACS 5.10 where we do see a little better results, but very similar, and we still wonder why this occurs. See below for specific results on these mentioned test scripts.
  • nsstats sometimes spins for more than 8 or 10 seconds before the page is returned. Being that we have all of these requests served by a separate server pool and I am the only one that is hitting it during my testing, I'm not sure why it would ever take so long to respond.
  • When we get under heavy load we start to see a proliferation of errors in the log: Error: socket(##): setsockopt(TCP_CORK) 0[or 1]: Invalid argument
  • Just recently we had a major slow down on our server where we tried to serve many video files and had some major slowdown. We have these being served via a separate server pool. The slowness was sometimes observed in initial startup of the video as well as buffering later on during the video playback. I am not sure of the full mechanics of a video stream, but I can see the file being served via a writer at the beginning and then, with subsequent reloads of the nsstats?_process page, the download is no longer shown, but I can still see requests from the browser obtaining subsequent segments of the video.

Could you give some direction on bettering our configuration, understanding how requests are handled, and why it seems like the server just stalls?

Is there a built-in way to turn on a more consistant / persistant logging of the values provided in nsstats that would help diagnose these problems and what are the best values to be observing?

Thanks. Any help is greatly appreciated.

Below you will find server run-time and configuration information that, hopefully, will convey, more precisely, what we are seeing and help with your response.

Here are the results of three runs of our request tests that show the blocking of server requests. Each line represents either the request being made (or returning) and states which page was requested for a request and which child process and the elapsed time from the beginning of the run for a return.

# This one fires all 10 forks within milliseconds – each page sleeps for 20 seconds
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Child (126632) Elapsed time = 20 seconds.
Child (126624) Elapsed time = 20 seconds.
Child (126625) Elapsed time = 40 seconds.
Child (126627) Elapsed time = 40 seconds.
Child (126630) Elapsed time = 60 seconds.
Child (126637) Elapsed time = 60 seconds.
Child (126622) Elapsed time = 80 seconds.
Child (126623) Elapsed time = 80 seconds.
Child (126626) Elapsed time = 100 seconds.
Child (126634) Elapsed time = 100 seconds.
 
# This one fires all 10 forks within milliseconds – first two pages are sleep 20, followed by 8 home pages that take ~3 seconds each
Sleep 20 Page
Sleep 20 Page
Home page
Home page
Home page
Home page
Home page
Home page
Home page
Home page
Child (49600) Elapsed time = 3 seconds.
Child (49607) Elapsed time = 3 seconds.
Child (49598) Elapsed time = 5 seconds.
Child (49595) Elapsed time = 8 seconds.
Child (49602) Elapsed time = 10 seconds.
Child (49596) Elapsed time = 13 seconds.
Child (49593) Elapsed time = 23 seconds.
Child (49604) Elapsed time = 25 seconds.
Child (49606) Elapsed time = 28 seconds.
Child (49594) Elapsed time = 33 seconds.
 
# In this script I sleep 5 seconds in between each fork()
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Child (94882) Elapsed time = 20 seconds.
Sleep 20 Page
Child (94951) Elapsed time = 20 seconds.
Sleep 20 Page
Sleep 20 Page
Sleep 20 Page
Child (95045) Elapsed time = 30 seconds.
Sleep 20 Page
Child (95130) Elapsed time = 30 seconds.
Child (95245) Elapsed time = 40 seconds.
Child (95382) Elapsed time = 40 seconds.
Child (95436) Elapsed time = 50 seconds.
Child (95485) Elapsed time = 50 seconds.
Child (95558) Elapsed time = 60 seconds.
Child (95635) Elapsed time = 60 seconds.

Here is example output of the nsstats?_page=process.

Key             |    Value
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Host            | serv (x.x.x.21) 
Boot Time       | Thu Jul 2 19:41:25 2020 
Uptime          | 3 days 16:31:53 (h:m:s) 
Process         | 22310 /.../nsd 
Home            | /.../ns 
Configuration   | /...serv.tcl 
Error Log       | /.../serv-error.log 
Log Statistics  | Notice 26K Warning 62K Error 3136 Fatal 0 Bug 0 Debug 0 Dev 0 Debug(ns:driver) 0 Debug(task) 0 Debug(request) 0 Debug(connchan) 0 Debug(sql) 0 Debug(nsproxy) 0 
Version         | 4.99.15 (tag tar-4.99.15)) 
Build Date      | Nov 3 2017 at 14:57:26 
Servers         | serv 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
Driver Info     | thread nssock:0 module nssock received 35 spooled 0 partial 61 errors 0 
thread          | nsssl:0 module nsssl received 220K spooled 0 partial 489K errors 452 
                |       
Certificate     | /....... 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
DB-Pools    
pool1:          | statements 15M gethandles 206K handles 30 used 25 waittime 7.708260 sqltime 85906.834913 avgwaittime 37µ avgsqltime 6m 
pool2:          | statements 291K gethandles 16K handles 5 used 5 waittime 26.012233 sqltime 845.943847 avgwaittime 2m avgsqltime 3m 
pool3:          | statements 0 gethandles 2 handles 5 used 1 waittime 0.008694 sqltime 0.000000 avgwaittime 4m 
ProdPool:       | statements 0 gethandles 50 handles 5 used 3 waittime 0.416256 sqltime 0.000000 avgwaittime 8m 
PprdPool:       | statements 0 gethandles 4 handles 5 used 2 waittime 0.034431 sqltime 0.000000 avgwaittime 9m 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
Callbacks       | 
prestartup:     | ns:tclcallback  \_ns_load_global_modules 1 
signal:         | p:fffffd74172d70c0  a:0 
shutdown:       | nsproxy:shutdown    a:0 
shutdown:       | nslog:close /....log 
Socket Callbacks      
 Key                |  Value          
--------------------------------------------------------------------------------------------------------------------- 
 Address            |  x.x.x.21:80 
                    | x.x.x.21:443 
 Server Directory   |    /.../ns 
 Page Directory     | /... 
 Tcl Library        | /...serv/tcl 
 Access Log         | /.../serv.log 
 Writer Threads     | nssock: 2 nsssl: 5 
 Connection Pools   | image monitor {} 
-------------------------------------------------------------------------------------------------------------------
Pool                | 'default'   
Connection Threads: | min 2 max 10 current 5 idle 4 stopping 0 waiting 0 
Request Handling:   | requests 150K, queued 12K (8.22%), spooled 91K (60.60%) 
Request Timing:     | avg queue time 0.0194s, avg filter time 0.0028s, avg run time 0.2549s avg trace time 0.0007s 
Active Requests:    | cns220472 x.x.x.39 running GET /hr/app/pictures/item/ 0.179676 0 . 
-------------------------------------------------------------------------------------------------------------------
Pool                | 'image'     
Connection Threads: | min 2 max 10 current 4 idle 4 stopping 0 waiting 0 
Request Handling:   | requests 66K, queued 6K (8.90%), spooled 59K (88.83%) 
Request Timing:     | avg queue time 0.0290s, avg filter time 0.0036s, avg run time 0.1405s avg trace time 0.0009s 
Active Requests:    |  
Mapped:             | 13 
------------------------------------------------------------------------------------------------------------------
Pool                | 'monitor'   
Connection Threads: | min 2 max 2 current 2 idle 1 stopping 0 waiting 0 
Request Handling:   | requests 4.6K, queued 0 (0.00%), spooled 8 (0.18%) 
Request Timing:     | avg queue time 0.0001s, avg filter time 0.0035s, avg run time 0.0035s avg trace time 0.0004s 
Active Requests:    | cns220474 x.x.x.113 running GET /admin/nsstats 0.037008 0 . 
Mapped:             | 4 
-------------------------------------------------------------------------------------------------------------------- 
Active Writer Jobs  1594059180.021175 -writer0- nsssl:0 x.x.x.37 -1 2902395 2687156 {281529 /parc/data/program-data} 

Here are portions of our configuration:

#---------------------------------------------------------------------
#
# SSL
#
#---------------------------------------------------------------------
 
ns_section    "ns/server/${server}/module/nsssl"
    ns_param         address         $address
    ns_param         hostname        $hostname
    ns_param         port            $httpsport
    ns_param         Protocols       "!SSLv2:!SSLv3"
    # ns_param        CipherSuite     "ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP"
    ns_param         CipherSuite     "ALL:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP"
    # ns_param      ciphers              "ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!RC4"
    # ns_param                ModuleDir       $serverroot/etc/certs
    ns_param         ModuleDir       /.../certs
    # ns_param         certificate     /.../mdc.pem
    ns_param         certificate     /....pem
    ns_param         verify          0
    ns_param         writerthreads   5
    ns_param         writersize      1024
    ns_param         writerbufsize   16384   ;# 8192, buffer size for writer threads
    ns_param         writerstreaming true    ;# false
    ns_param         deferaccept     true    ;# false, Performance optimization
    ns_param         maxinput        [expr {$max_file_upload_mb * 1024*1024}] ;# Maximum File Size for uploads in bytes
 
#
# Server parameters
#
ns_section ns/server/${server}
    #
    # Scaling and Tuning Options
    #
    # ns_param      maxconnections  100     ;# 100; number of allocated connection stuctures
    # ns_param      maxthreads      5       ;# 10; maximal number of connection threads
    ns_param        minthreads      2       ;# 1; minimal number of connection threads
    ns_param        connsperthread  10000   ;# 10000; number of connections (requests) handled per thread
    # ns_param      threadtimeout   120     ;# 120; timeout for idle theads
    # ns_param      lowwatermark    10      ;# 10; create additional threads above this queue-full percentage
    ns_param        highwatermark   100     ;# 80; allow concurrent creates above this queue-is percentage
                                                ;# 100 means to disable concurrent creates
 
#
# Server pools
# added 2020-02-26 to try to address server responsiveness issues
#
ns_section ns/server/${server}/pools
    ns_param        monitor         "Monitoring actions to check the health of the system"
    ns_param        image           "Image file requests"
ns_section ns/server/${server}/pool/monitor
    ns_param        minthreads 2
    ns_param        maxthreads 2
 
    ns_param        map "GET /admin/nsstats"
    ns_param        map "GET /SYSTEM"
    ns_param        map "GET /ds"
    ns_param        map "POST /ds"
ns_section ns/server/${server}/pool/image
    ns_param        minthreads 2
    # ns_param        maxthreads 2
 
    ns_param        map "GET /*.png"
    ns_param        map "GET /*.jpg"
    ns_param        map "GET /*.PNG"
    ns_param        map "GET /*.JPG"
    ns_param        map "GET /*.JPEG"
    ns_param        map "GET /*.jpeg"
    ns_param        map "GET /*.mp4"
    ns_param        map "GET /*.MP4"
    ns_param        map "GET /*.pdf"
    ns_param        map "GET /*.PDF"
    ns_param        map "GET /*.gif"
    ns_param        map "GET /*.GIF"
    ns_param        map "GET /doc-control/view/stored-record" ; # This is another url that serves .mp4 via a .vuh

The database pools

We are using PostgreSQL 10.0 with

  • pool1 30 connections
  • pool2 5 connections
  • pool3 5 connections

We also have other connections to an Oracle instance.

Collapse
Posted by Gustaf Neumann on
Hi Tony,

I am trying to find out how to best configure NaviServer and OpenACS for a system that hosts home grown business applications for 1000 concurrent users.

Running a system with this configuration and 1K+ users should be OK, although 8 cores are not much these days. The strong multi-threading of NaviServer can make use of all cores you provide. The largest installation of NaviServer + OpenACS i know run with ~9k concurrent users on a VM with 50 cores, where the same VM hosts nsd and the PostgreSQL server. On our university live system, we see up to about 3k concurrent users with intense usage such as serving one live videostream to 1500k users, etc. with average latencies < 60ms.

The first question coming up is the DB server: Is the PostgreSQL server running on the same machine? Probably yes, but the numbers are not great (we see e.g. avgwaittime 3.3µs avgsqltime 983.3µs). So it is probably a good time to look at cache statistics (in nsstats) and SQL statistics (as provided by pg).

The provided statistics show high numbers of queued requests (we see e.g.: 0.12% and less), which indicates also a probably queue overruns (watch out for "All available connections are used"). The queue length of the pool is determined by "maxconnections". You can identify these queue overruns by "All available connections are used" entries in the system log (named by default error.log). From what you describe, i am pretty sure, you will find such entries in the system log.

The goal should be (a) to reduce the queue overruns and (b) to improve the server behavior in these cases. Concerning (a) on has to figure out from which connection pool this is coming. You have also quite high values for queued requests on the "image" pool. I would recommend to define minthreads=maxthreads=4 (or maybe 6) for this pool, since image requests are often bulky. If one has a large blueprint (which is the case for most OpenACS installations) the cost of starting a thread can take 1s, so frequent starting and stopping of threads can be costly. In case you are using the xotcl-request monitor, i would recommend to create a "slow" pool, since the request monitor can dynamically assign requests from the default pool to the "slow" pool when these take a long time (longer than 3secs). This is part of the current xotcl-request monitor of oacs-5-10.

There is an improvement for handling of queue overruns in the recent versions of naviserver on bitbucket [1], implemented ~6 weeks ago, addressing the symptoms you describe above. NaviServer (and AOLServer) assumed that when a queue overruns, the server should stop to accept more incoming requests until there is space in the queue. This assumption was probably ok as long there is only one queue (= connection pool), but not useful in multi-pool setups. It took me a while to figure out the details, why on some slow sites suddenly nsstats was not responding, although there were separate monitor threads configured.

Is there a built-in way to turn on a more consistent / persistent logging of the values provided in nsstats
Look at [2]. The munin modules for NaviServer are constantly improved and provide an overview of what's normal, how the load patterns changes over time, etc.

Just recently we had a major slow down on our server where we tried to serve many video files and had some major slowdown.

How do you serve these video files? do you server these from the file-system, or via OpenACS file storage? Are these single chunks (e.g. mp4 files) or segmented video streams (e.g. HLS)?

Hope, this helps as a starter....
-g

[1] https://bitbucket.org/naviserver/naviserver/commits/1a387ddff2108a2c12c4b48421c52a96edd3d843
[2] https://github.com/gustafn/munin-plugins-ns