Forum OpenACS Q&A: NaviServer cluster running inside of docker questions

We are continuing to test the NaviServer cluster using the cachingmode=none option and would like to restart all servers in the cluster after an upgrade. After upgrade the acs-admin/server-restart page is called which restarts the current server but we would like to have this url restart all of the servers in the cluster.

We are testing on release/5.10 branch of OACS. with naviserver 4.99.23 build.

Since we are running inside of docker containers/network we do not know the IP addresses of the containers in advance and cannot enter them into the ClusterAuthorizedIP kernel parameter. The IP ranges could be used but opens up the range too far as docker assigns IPs in a larger range than we are comfortable.

We have run into the following issues:

1. Trying to use the ::acs::clusterwide cmd does not appear to call the servers in the cluster currently.

acs-cache-procs.tcl -> broadcast calls the following code:
foreach server [::acs::Cluster info instances] {
      ns_log notice "CALLING ===> $server message $args"
      $server message {*}$args
}

However, whenever the [::acs::Cluster info instances] is called, it returns nothing and will not enter into the foreach loop. I have verified during boot up that we have created the server instances (and can loop over them during boot up) but after we get all of the way up and issue the [::acs::Cluster info instances] cmd it does not return anything. I have tried running the following command from acs-admin/server-restart.tcl and from the shell, but it still does not seem to work. Any ideas or insights you may have on this would be greatly appreciated.

2. Since we are inside a docker network and cannot know the IPs of the servers in the cluster beforehand I have written some code to do an nslookup on the docker network names to get their IP addresses inside of cluster-init.tcl and server-cluster-procs.tcl. There are some timing issues with doing this, so I was wondering if you have any opinions/suggestions on a better way to do it.

cluster-init.tcl
#
# Check if cluster is enabled, and if, set up the custer objects
#
if {[server_cluster_enabled_p]} {
    set myConfig [server_cluster_my_config]
    set cluster_do_url [::acs::Cluster eval {set :url}]

    #
    # Iterate over all servers in the cluster and add Cluster objects
    # for the ones, which are different from the current host (the
    # peer hosts).
    #
    foreach hostport [server_cluster_all_hosts] {
        set config [server_cluster_get_config $hostport]
        dict with config {

            # If inside Docker get the IP of the host and put into allowed_host list
            if {[info exists ::env(NS_INSIDE_DOCKER)] && $::env(NS_INSIDE_DOCKER) eq "true"} {
                set ip [docker_host_to_ip $host]
                if { $ip eq "0" } {
                    ns_log error "FAILED to find ip for $host !!"           
                } else {
                    ::acs::Cluster eval [subst {
                        set :allowed_host($ip) 1
                    }]
                }
            }

            if {$host in [dict get $myConfig host]
                && $port in [dict get $myConfig port]
            } {
                ns_log notice "Cluster: server $host $port is no cluster peer"
                continue
            }
            ns_log notice "===> Cluster: server $host $port is a cluster peer $cluster_do_url"
            ::acs::Cluster create CS_${host}_${port} \
                -host $host \
                -port $port \
                -url $cluster_do_url
        }
    }

    set info [::acs::Cluster info instances]
    ns_log notice "=====> CLUSTER INFO = $info !!"
    foreach server [::acs::Cluster info instances] {
        ns_log notice "==> SERVER ====> $server"
    }

    if {![info exists ::env(NS_INSIDE_DOCKER)] || $::env(NS_INSIDE_DOCKER) ne "true"} {
        ns_log notice "==>> NS_INSIDE_DOCKER=$::env(NS_INSIDE_DOCKER)"
        foreach ip [parameter::get -package_id $::acs::kernel_id -parameter ClusterAuthorizedIP] {
            ns_log notice "==> AuthorizedIP = $ip"
            if {[string first * $ip] > -1} {
                ns_log notice "==> ALLOWED_HOST_PATTERN=$ip"
                ::acs::Cluster eval [subst {
                    lappend :allowed_host_patterns $ip
                }]
            } else {
                ns_log notice "===> Allowing Cluster IP=$ip"
                ::acs::Cluster eval [subst {
                    set :allowed_host($ip) 1
                }]
            }
        }
    }

    set url [::acs::Cluster eval {set :url}]

    #
    # TODO: The following test does not work yet, since
    # "::xo::db::sql::site_node" is not yet defined. This requires
    # more refactoring from xo* to the main infrastructure.
    #
    if {0} {
        # Check, if the filter url mirrors a site node. If so,
        # the cluster mechanism will not work, if the site node
        # requires a login. Clustering will only work if the
        # root node is freely accessible.

        array set node [site_node::get -url $url]
        if {$node(url) ne "/"} {
            ns_log notice "***\n*** WARNING: there appears a package mounted on" \
                "$url\n***Cluster configuration will not work" \
                "since there is a conflict with the filter with the same name! (n)"
        }
    }

    #ns_register_filter trace GET $url ::acs::Cluster
    ns_register_filter preauth GET $url ::acs::Cluster
    #ad_register_filter -priority 900 preauth GET $url ::acs::Cluster
}

We start all of our naviservers at the same time on bootup with docker-compose. So they are basically starting up at the same time and get their networks defined at the same time too. I added the following code because I know there could be a timing issue here. I have seen it hit the code once on a retry and then got it the second time. But for the most part it should just work without retries.

server-cluster-procs.tcl

Your thoughts and insights would be greatly appreciated. Is there a better way to implement this for a docker environment?

ad_proc docker_host_to_ip {
    docker_host
    {-retry_cnt 3}
    {-ms_sleep_on_retry 1000}
} {

    Use nslookup to resolve docker hostname to ip

} {
    if { ![server_cluster_enabled_p] } {
        return 0
    }

    for {set i 0} {$i <= $retry_cnt} {incr i} {
        set cnt [expr $i + 1]
        # Note: nslookup must be present inside the docker image.
        # Note: 127.0.0.11 is always the docker resolver
        set cmd "nslookup $docker_host 127.0.0.11"

        if { [catch {set nslookup_info [eval exec $cmd]} errmsg] } {
            ns_log notice " (try $cnt of $retry_cnt) executing $cmd: $errmsg"
        } else {
            set match [regsub {.*Address: ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*} $nslookup_info {\1} ip]
            if {$match} {
                ns_log notice "Matched IP=$ip for HOST=$docker_host"
                return $ip
            }
            ns_log notice " (try $cnt of $retry_cnt) no nslookup match found for ($docker_host) nslookup output=$nslookup_info"
        }
        # It is possible that this particular docker network is just not up yet in the docker boot-up process.  
        # Sleep before retrying.  Once second should be more than enough.
        after $ms_sleep_on_retry
    }

    return 0
}

Also, I found a little issue in the cluster-init.tcl file concerning setting of the allowed_host_patterns member variable. It was missing the leading colon on the member variable. It should be 'lappend :allowed_host_patterns $ip'

32            ::acs::Cluster eval [subst {
33      ==>          lappend allowed_host_patterns $ip
34            }]

Thanks for your assistance, Marty

We are required (for security reasons) to run https port 443 for all communication between processes.

When trying to run the cluster behind nginx using https port 443 I ran into a problem where the code would only support http port 80. Here is my solution to get around this issue.

release/5.10 branch of OACS. with naviserver 4.99.23 build

server-cluser-procs.tcl -> server_cluster_my_config

ad_proc -private server_cluster_my_config {} {
} {

    set driver_section [ns_driversection -driver nsssl]
    set my_ips   [ns_config $driver_section address]
    set my_ports [ns_config -int $driver_section port]

    if {$my_ips eq "" || $my_ports eq ""} {
        set driver_section [ns_driversection -driver nssock]
        set my_ips   [ns_config $driver_section address]
        set my_ports [ns_config -int $driver_section port]
    }

    set my_ips   [ns_config $driver_section address]
    set my_ports [ns_config -int $driver_section port]
    return [list host $my_ips port $my_ports]
}
If there is a better solution, please let me know. We appreciate your knowledge and assistance;) Marty

You are addressing multiple points, and some of these look like feature requests.

However, whenever the [::acs::Cluster info instances] is called it returns nothing

This indicates that the kernel parameters of the cluster configuration are not set up correctly. You have to set "ClusterEnabledP" to "1" and you have to provide "ClusterPeerIP" in the required format, i.e., a list of IP address with optional ports, like e.g., "127.0.0.1:8100 127.0.0.1:8101".

Per design, all cluster specifications are IP addresses plus optional ports (no name resolving) and HTTP only (for performance reasons), therefore the scheme is omitted.

behind nginx using https port 443 I ran into a problem where the code would only support http port 80.

Probably you meant that the intra-cluster talk is only HTTP and not HTTPS. There is no restriction to port 80; the intra-cluster talk is implemented via "ns_http run http://${:host}:${:port}" (note the hard-coded "http:").

A possible way to address these points would be a feature request to allow a list of server locations instead of just IP addresses and ports in "ClusterPeerIP" (e.g. 127.0.0.1:8100 https://localhost:8444). Still, I would not recommend the usage of HTTPS or the use of domain names, but if you have to do so...

Would this help for your major requirements?

Use nslookup to resolve docker hostname to ip

One should use the NaviServer built-in command "ns_addrbyhost" [1] instead. Be aware, that in general, multiple IP addresses can be returned from the DNS lookup (might be different hosts or IPv4 and IPv6 of the same host).

Also, I found a little issue in the cluster-init.tcl

Thanks, fixed.

[1] https://naviserver.sourceforge.io/n/naviserver/files/ns_addrbyhost.html

Thanks Gustaf,

We do have the ClusterEnabledP set to 1, but you are right, we are trying to put docker DNS names into the ClusterPeerIP, CanonicalServer and even in the ClusterAuthorizedIP.

Your point is well taken that this really belongs under a feature request to make clustering work with docker. Inside Docker we cannot know the IP addresses that are assigned because they are dynamic. Is the best way to request this enhancement, to submit an improvement proposal?

In the meantime cachingmode=none will work for us if we manually restart the backend naviservers ourselves by using 'docker restart {container names}'

We do want to talk https to the backend clusters for a number of reasons. Besides the fact that our company wants us to run https for everything, there also is the reason that the Canonical Server is the only one that can run cronjobs. So when we go to the /cronjobs URL we cannot leave it up to docker which server gets that request. We currently are redirecting to the external port 444 which is our Canonical Server port that we have exposed to the outside with docker. Since only admins can get to /cronjobs this seems to work out fine. There may be another way to do this using nginx, but currently we like getting for each NaviServer directly from the ouside ports for debug purposes and especially the Canonical Server. Also, when we go to acs-admin/install we have also redirected to the Canoncial Server so that if someone restarts the server we know it is the Canonical Server they restarted - then we can manually restart the other two after the Canonical has come all the way back up. Doing it this way, we have a zero downtime restart for our users for non-schema change upgrades.

##  I have added this code to both /cronjob and acs-admin/install to allow redirection to the canonical server
util::split_location [util_current_location] proto host port
if {[info exists ::env(CANONICAL_OUTSIDE_PORT)] && $::env(CANONICAL_OUTSIDE_PORT) != $port} {
    ad_returnredirect -allow_complete_url "$proto://$host:$::env(CANONICAL_OUTSIDE_PORT)[util_current_directory]"
}

Thanks for pointing out ns_addrbyhost. I was hoping there was such a utility but could not find it.

I sure appreciate your insights and expertise

Thanks, Marty

Hi Marty,

I've added a small change to CVS to support the usage of HTTP locations in kernel parameter ClusterPeerIP

The change extends the values specified in ClusterPeerIP in two respects

  • one can now specify the protocol (defaults to http)
  • while ClusterPeerIP required an IP address, it is now possible to specify an DNS name

The DNS name is resolved for the time being at the start time of the OpenACS instance.

Sample supported values:

https://localhost http://[::1]:8443 127.0.0.1:8101

The last one defaults to "http", same as before. By using the "location" including the scheme we can support also UDP in the future (by using the nsudp NaviServer module) without refactoring. UDP will reduce the latency of the intra-cluster talk significantly.

Two questions:

  • is the DNS resolving at the start time of the OpenACS instance sufficient for your use case in Docker?
  • why did you specify the docker resolver (127.0.0.11) explicitly? It should be set up in the docker instance by default.

-g https://fisheye.openacs.org/changelog/OpenACS?cs=oacs-5-10%3Agustafn%3A20220614175936

When I click on the link to see the recent changes you made it says "No changeset with that ID found.".

Sorry, markdown syntax problem, ....fixed
Thanks, Gustaf,

to answer your questions:

I think that it could be a problem to resolve the DNS on startup because of the timing issues. It is possible that they do not all startup at the exact same time, and therefore the code would be unable to resolve the DNS on startup.

Is there another place we could also try to resolve the DNS like at the time a custerwide command is issued, and we detect that one or more of the servers have not been resolved?

As for why I specified 127.0.0.11 in the nslookup? I believe you are right, it should be able to resolve without specifying the resolver.

Thanks for your help

Marty
Good point. The new version [1] adds dynamic reconfiguration and configuration checking. A background job tests regularly the availability of the cluster nodes and makes these available.

One should not accept requests on cluster nodes before they are registered in the cluster, but this is in the "nocache" case not a big issue.

[1] https://fisheye.openacs.org/changelog/OpenACS?cs=oacs-5-10%3Agustafn%3A20220621105039