Forum OpenACS Q&A: Issue with reachable method in 5.10.1 dynamic clustering setup

I have a 5.10.1 OpenACS environment with clustering enabled and using dynamic clustering.

Our member nodes join with addresses in the format of https://10.x.y.z (no port).

You can see in the log messages here the join request

[14/Nov/2023:11:04:32][31569.7f117cce3700][-conn:default:0:3269-] Notice: --cluster executes command '::acs::cluster join_request https://10.34.24.222'; from peeraddr 10.34.24.222 port 58181
[14/Nov/2023:11:04:32][31569.7f117cce3700][-conn:default:0:3269-] Notice: Cluster join_request from 'https://10.34.24.222';
[14/Nov/2023:11:04:32][31569.7f117cce3700][-conn:default:0:3269-] Notice: Cluster join_request https://10.34.24.222 accepted from https://10.34.24.222
[14/Nov/2023:11:04:32][31569.7f117cce3700][-conn:0:3269-] Notice: Cluster join_request leads to DynamicClusterPeers https://10.34.24.222

The issue occurs within the reachability check inside packages/acs-tcl/tcl/cluster-procs.tcl (This is in acs-tcl package version 5.10.1b1)

The code parses the url as seen below:

:method reachable {location} {
            #:log "reachable $location"
            set d [ns_parseurl $location]

and then further down attempts to connect with `ns_connchan connect $host $port`

However given our nodes are registering without a port there is no $port variable which causes an exception. This exception causes the member node to be considered 'not reachable'

A patch for my own environment is to add the following code:

                        try {
                            if {![info exists port]} {
                                if {$proto eq "http"} {
                                    set port 80
                                } else {
                                    set port 443
                                }
                            }
                            #ns_logctl severity Debug(connchan) on
                            ns_connchan connect $host $port
                        } on error {} {

I assume maybe it is always expecting it to be registering nodes with <proto>://<host>:<port> but in our environment that is not how the nodes are registering to be listed in dynamic cluster peers.

Is there a reason, why are you not registering the cluster nodes with a port?
Our configuration is using the default ports of 443 and 80 for https and http. Debugging the code I found that the method current_server_locations is utilizing this helper method util::join_location which is dropping the port from the location

Here is some logging I added to the current_server_locations method to debug

[21/Nov/2023:09:29:25][21757.7f4344aeb840][-main:-1-] Notice: current_server_locations: use module <nsssl> /web/dev/node-1/httpd/bin/nsssl.so
[21/Nov/2023:09:29:25][21757.7f4344aeb840][-main:-1-] Notice: current_server_locations: ip 10.34.203.163
[21/Nov/2023:09:29:25][21757.7f4344aeb840][-main:-1-] Notice: current_server_locations: port 443
[21/Nov/2023:09:29:25][21757.7f4344aeb840][-main:-1-] Notice: current_server_locations: use module <nssock> /web/dev/node-1/httpd/bin/nssock.so
[21/Nov/2023:09:29:25][21757.7f4344aeb840][-main:-1-] Notice: current_server_locations: ip 10.34.203.163
[21/Nov/2023:09:29:25][21757.7f4344aeb840][-main:-1-] Notice: current_server_locations: port 80
[21/Nov/2023:09:29:25][21757.7f4344aeb840][-main:-1-] Notice: current_server_locations returns http://10.34.203.163 https://10.34.203.163

Here is the relevant code that results in the port being dropped:

packages/acs-tcl/tcl/cluster-procs.tcl

lappend result [util::join_location \
      -proto [dict get $protos $module_type] \
      -hostname $ip \
      -port $port]

packages/acs-tcl/tcl/utilities-procs.tcl

ad_proc util::join_location {{-proto ""} {-hostname} {-port ""}} {
    Join hostname and port and use IP-literal notation when necessary.
    The function is the inverse function of  util::split_location.
    @return location consisting of hostname and optionally port
    @author Gustaf Neumann
    @see util::split_location
} {
    set result ""
    if {$proto ne ""} {
        append result $proto://
        #
        # When the specified port is equal to the default port, omit
        # it from the result.
        #
        if {$port ne "" && $port eq [dict get {http 80 https 443 udp "" smtp ""} $proto]} {
            set port ""
        }
    }

Dear Jonathan,

i've looked into the problem of the incoming join request. I think, the proper change is the following:

https://cvs.openacs.org/changelog/OpenACS?cs=oacs-5-10%3Agustafn%3A20231127183257

all the best
-g

Hi Gustaf,

I tried this change out but the method calls util::join_location which has the problem I pointed out before that drops the standard port.

https://cvs.openacs.org/browse/OpenACS/openacs-4/packages/acs-tcl/tcl/cluster-procs.tcl?r=1.1.2.9#to524

This causes the same problem with nodes being considered not reachable due to the missing port

[27/Nov/2023:11:37:02][27994.7f62556f3700][-conn:default:0:10-] Notice: Cluster join_request from 'https://10.34.203.85';
[27/Nov/2023:11:37:02][27994.7f62556f3700][-conn:default:0:10-] Notice: Cluster join_request https://10.34.203.85 accepted from https://10.34.203.85
[27/Nov/2023:11:37:02][27994.7f62556f3700][-conn::default:0:10-] Notice: Cluster join_request leads to DynamicClusterPeers https://10.34.203.85
[27/Nov/2023:11:37:05][27994.7f624d7fa700][-sched:0:8:3-] Notice: Running scheduled proc ::acs::cluster {*}update_node_info...
[27/Nov/2023:11:37:05][27994.7f624d7fa700][-sched:0:8:3-] Notice: cluster: node https://10.34.203.85 is reachable: 0 last_contact 1701113824362 last_request 1701113824362

Best,
Jon

Dear Jonathan,

now I see the problem ... I was trusting my own comment

Return a canonical representation of the provided location, where
the DNS name is resolved and the protocol and port is always included.

It is now fixed in the repository:
https://cvs.openacs.org/changelog/OpenACS?cs=oacs-5-10%3Agustafn%3A20231128155508

Many thanks for insisting!

Hi Gustaf,

I tested out the changes you have made and found that a few more changes are required and one slight change to the util::join_location method.

Here is a git diff showing the necessary changes. Let me know if you need more info since maybe the line numbers don't match up on your side. It includes all of the changes from this thread and a couple of other changes. Mainly two additional changes

1. Inside current_server_locations needs to use the noabbrev flag as well otherwise it won't match when checking against the dynamic peers list to see if it is already in the list and a few other checks

2. In util::join_location the port was being set to "" before the new if clause in your last patch since $proto is set. Needed to make sure noabbrev_p  is false in that case to not clear the port. Also the new check needs to validate noabbrev_p is true rather than false to append the port.

diff --git a/packages/acs-tcl/tcl/cluster-procs.tcl b/packages/acs-tcl/tcl/cluster-procs.tcl
@@ -529,7 +530,7 @@ namespace eval ::acs {
            #
            dict set d host [ns_addrbyhost [dict get $d host]]
            dict with d {
-                set result [util::join_location -proto $proto -hostname $host -port $port]
+                set result [util::join_location -noabbrev -proto $proto -hostname $host -port $port]
            }
            return $result
        }
@@ -586,15 +587,17 @@ namespace eval ::acs {
                                lappend result [util::join_location \
-                                                    -proto [dict get $protos $module_type] \
+                                                    -noabbrev -proto [dict get $protos $module_type] \
                                                    -hostname $ip \
                                                    -port $port]
                            }
@@ -646,7 +649,8 @@ namespace eval ::acs {
                #
                ns_log notice "Cluster join_request $peerLocation accepted from $peerLocation"
                set dynamicClusterNodes [parameter::get -package_id $::acs::kernel_id -parameter DynamicClusterPeers]
-                set dynamicClusterNodes [lsort -unique [concat $dynamicClusterNodes $peerLocation]]
+                set dynamicClusterNodes [lsort -unique [concat $dynamicClusterNodes [:qualified_location $peerLocation]]]
                #
                # The parameter::set_value operation causes a
                # clusterwide cache-flush for the parameters

diff --git a/packages/acs-tcl/tcl/utilities-procs.tcl b/packages/acs-tcl/tcl/utilities-procs.tcl
index 4b41b73ba..89b289b22 100644
--- a/packages/acs-tcl/tcl/utilities-procs.tcl
+++ b/packages/acs-tcl/tcl/utilities-procs.tcl
@@ -1923,13 +1923,16 @@ ad_proc util::split_location {location protoVar hostnameVar portVar} {
    return $success
}

-ad_proc util::join_location {{-proto ""} {-hostname} {-port ""}} {
+ad_proc util::join_location {{-noabbrev:boolean} {-proto ""} {-hostname} {-port ""}} {
    Join hostname and port and use IP-literal notation when necessary.
    The function is the inverse function of  util::split_location.
+    @param noabbrev when specified, the location is joined as requested.
+                     Otherwise, default ports are omitted from the result.
    @return location consisting of hostname and optionally port
    @author Gustaf Neumann
    @see util::split_location
} {
    set result ""
    if {$proto ne ""} {
        append result $proto://
@@ -1937,7 +1940,7 @@ ad_proc util::join_location {{-proto ""} {-hostname} {-port ""}} {
        # When the specified port is equal to the default port, omit
        # it from the result.
        #
-        if {$port ne "" && $port eq [dict get {http 80 https 443 udp "" smtp ""} $proto]} {
+        if {!$noabbrev_p && $port ne "" && $port eq [dict get {http 80 https 443 udp "" smtp ""} $proto]} {
            set port ""
        }
    }
@@ -1946,7 +1949,10 @@ ad_proc util::join_location {{-proto ""} {-hostname} {-port ""}} {
    } else {
        append result $hostname
    }
-    if {$port ne ""} {
+    if {$noabbrev_p
+         && $port ne ""
+         && $port eq [dict get {http 80 https 443 udp "" smtp ""} $proto]
+     } {
        append result :$port
    }
    return $result

Your diff got probably mixed up, since it proposes partly the changes, that I made two days ago.

Anyhow, all element of "currentServerLocations" should go through the method "qualified_location". You were right, this was still missing when updating it on a node running on a server with a default port. I have updated it in the repository.

Sorry for the slow reactions. I am currently fully booked with other obligations. Obviously, all testing and setups on our side were made with non-default ports. Furthermore, due to rearrangements of the server infrastructure of our institution, I have lost my testing environment, where I can do cluster testing. It will take some more time until I can reestablish it.

btw: to check the state of the cluster, use either /acs-admin/cluster on the cluster node, or issue on the node in ds/shell the command "acs::cluster serialize" to see defined variables.
Thanks for the update.

The combination of all the patches from this thread do fix the reachability issue we have been seeing in my testing.

I do sort of prefer the option to not have to include the port which was working with my earlier patch to the reachable method. Mainly because we have some supporting code to auto configure the canonical as well as cleaning up any stale dynamic peers that now needs to consider the port being specified.

diff --git a/packages/acs-tcl/tcl/cluster-procs.tcl b/packages/acs-tcl/tcl/cluster-procs.tcl
index 3db0afb84..a6ffaab79 100644
--- a/packages/acs-tcl/tcl/cluster-procs.tcl
+++ b/packages/acs-tcl/tcl/cluster-procs.tcl
@@ -397,12 +397,20 @@ namespace eval ::acs {
                        # We can check via ns_connchan
                        #
                        try {
-                            #ns_logctl severity Debug(connchan) on
+                            if {![info exists port]} {
+                                if {$proto eq "http"} {
+                                    set port 80
+                                } else {
+                                    set port 443
+                                }
+                            }
+                            # ns_logctl severity Debug(connchan) on
                            ns_connchan connect $host $port
-                        } on error {} {
+                        } on error {errorMsg} {
                            #
                            # Not reachable, stick with the default 0
                            #
+                            ns_log Notice "'$host' is not reachable on port '$port': $errorMsg"
                        } on ok {chan} {
                            set result 1
                            ns_connchan close $chan

On a related topic, does the dynamic peer parameter ever clean itself up based on some inactivity timer?

We have an architecture that would scale up and down based on load and this list of dynamic peers would be cluttered if we didn't do some manual pruning of the parameter.

It is important to keep the cluster nodes in a canonical notation that is provided by a single function, which also works for multiple protocols (including UDP and other potential future protocols).

does the dynamic peer parameter ever clean itself up based on some inactivity timer?
Yes, there is a heart beat (default every 20s) checking for the liveliness of potentially silent nodes. In the latest version, one can configure this interval via the kernel parameter "ClusterHeartbeatInterval". Also, i've updated the cluster node status page to make it more self-explanatory.
I reviewed the changes but I don't see anything that automatically removes inactive nodes from DynamicClusterPeers.

It looks like it would require manual cleanup using the status page as a guide for the level of inactivity?

The recent change was just about the display on the cluster admin page. The detection of not-reachable servers is there since a while.
Here is a screenshot with the canonical server and an active node

... and a screenshot with the canonical server and an inactive node

It looks like it would require manual cleanup using the status page as a guide for the level of inactivity?
What makes you think this?
If you experience an error, please submit a bug report.

-g

In my testing nodes have been inactive for an hour or more and not removed from this page or the DynamicClusterPeers parameter.

I haven't pulled down the latest but I didn't see any code change in your commits or the source I have that does this cleanup.

If there is code to do the automatic cleanup can you send along a pointer so I can review.

Thanks.

In my testing nodes have been inactive for an hour or more and not removed from this page or the DynamicClusterPeers parameter.

It is intentional to show on the cluster-admin page inactive cluster nodes to show their state, the last contact, etc. If these nodes would be deleted from the admin page, the administration would be much harder.

It would be possible to provide a display mode to omit these nodes from the listing. Such a mode might be useful, when there is a high number of cluster nodes (more than one page), but I am not sure if this is somewhere needed.

It would also be possible to distinguish between short network separations (short times of lost reachability) and final deletion (when one scales a cluster up and shrinks it down later, without any intention to re-include after a network separation these nodes). Maybe this is what you are looking for?

Yes, we will have scaling events that bring up dynamic nodes with new ips and then bring them down when load is decreased. This will lead to a large set of inactive hosts over time if the load increases/decreases multiple times a day or over a week.

We have some logic to clean up similar parameters ClusterPeerIP and ClusterAuthorizedIP for this type of situation but thought maybe the DynamicClusterPeers would be self-controlled within the core functionality.

Yes, we will have scaling events ...
There is since yesterday an updated version (to be included in OpenACS 5.10.1) which has an additional parameter "ClusterAutodeleteInterval" (default 2m) to perform an automatic cleanup for dynamic cluster nodes after the specified time of missing reachability.

We can also consider a "disconnect_request" in analogy to the "join_request", which could be automatically issued, when a dynamic cluster node shuts down gracefully. This would not be a full replacement of the auto-delete mechanism described above.

This will be very helpful. Do you have a timeline for the 5.10.1 release that would include this functionality? I noticed some of the earlier changes in this thread were part of a 5.10.1b update to the relevant packages.
Do you have a timeline for the 5.10.1 release
The timeline for the release was actually the EuroTcl/OpenACS conference (this summer). As it was announced at the conference, "just" the documentation (Changelog summary of 45K lines Changelog, if i remember correctly) was missing. I'll take this as an opportunity to bring this up again internally.

Tcl9 should be released by the end of this year, and - judging from the large number of changes required in tcllib to make it Tcl9 compliant - it is not unlikely that there will be as well many changes in OpenACS. Therefore, it might be a good idea for having also OpenACS 5.10.1 out this year to provide a "stable" version and to continue with the tcl9 compliant version in a different branch.

Ok. Well I see that in the last commit you bumped the package versions to support the new "auto cleanup" parameter so we should be able to use those 5.10.1b versions until the release is official.

Thanks.

Dear Jonathan,
The change implementing automatic disconnects, when a dynamic cluster node is shut down, is committed to the repository;

https://github.com/openacs/openacs-core/commit/7cbc3e63c6ac6b0ab5647f0d9fdf82fe883ded10

all the best
-g