Forum OpenACS Q&A: Naviserver upgrade issue on RHEL 7.9

Hi all,

Upgrading an older installation running on RHEL 7.9 from OpenACS 5.8 to 5.10 and in the process upgrading Naviserver/Tcl. Using the latest install-ns script, we were able to build Naviserver 4.99.31 with the defaults for dependencies (Tcl 8.6.16, etc.).

When trying to start the server, even with the included simple config, the output looks like this:

$ sudo /usr/local/ns/bin/nsd -u nsadmin -g nsadmin -f -t /usr/local/ns/conf/simple-config.tcl
[-main:conf-] Notice: OpenSSL 1.0.2k-fips 26 Jan 2017 initialized (pid 23963)
[-main:conf-] Notice: initialized locale en_US.UTF-8 from environment variable LANG
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:conf-] Notice: nsmain: NaviServer/4.99.31 (tar-4.99.31) starting
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:conf-] Notice: nsmain: security info: uid=1002, euid=1002, gid=1002, egid=1002
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:conf-] Notice: nsmain: Tcl version: 8.6.16
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:conf-] Notice: nsmain: max files: soft limit 4096, hard limit 4096
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:conf-] Warning: nsmain: current limit of maximum number of files > FD_SETSIZE (1024), select() calls should not be used
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: init server default: using zlib version 1.2.7
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: pool default: queueLength 90 low water 9 high water 72
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: nsd/init.tcl[default]: booting virtual server: Tcl system encoding: "utf-8"
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: modload: loading module nslog from file nslog
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: nslog: opened '/usr/local/ns/logs/access.log'
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: modload: loading module nssock from file nssock
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: nssock:0: enable 0 spooler thread(s)
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Notice: nssock:0: enable 0 writer thread(s)
[30/Apr/2025:13:23:36][23963.7f58bdf0b980][-main:default-] Fatal: received fatal signal 11
Aborted

strace and gdb have not helped turn up any clues.

Any suggestions?

Thanks.

Collapse
Posted by Gustaf Neumann on

Hi Michael,

My first guess is that this comes from a binary mismatch (C based components compiled with a different Tcl version).

Compile with debug enabled in a fresh build dir, such as e.g. with the following command:

sudo with_debug_flags=1 build_dir=/usr/local/ns-src \
     bash install-ns.sh build

If you still see a crash, run nsd under gdb and show me the backtrace. If there is then still some problem, i will try to install somewhere a VM with RHEL 7.9.

all the best
-g

Collapse
Posted by Michael Steigman on
Thanks, Gustaf. Clean build directory remedied the issue.

We're in the process of slowly upgrading our project to RHEL 9. The build was the first step.

We're currently running the same Nginx proxy settings as those in front of our stage/prod server and the same config.tcl but with this new build and OpenACS 5.10 imported into our code base, page requests for / result in "too many redirects" errors. Looking for cookie settings and anything else that could be in play here. Do you have any suggestions? Any adjustments that might be necessary to config.tcl or on the NS side?

For example, from a different machine, a curl command like

curl -i -L https://mydomain.com/acs-admin/

results in a stream of 302s to

Location: https://mydomain.com/acs-admin/?

Thanks.

Collapse
Posted by Gustaf Neumann on

Glad that the clean install helped for the original problem!

Concerning the redirects: Maybe the following can shed light on this.

Add the following code to the end of packages/acs-tcl/tcl/utilities-procs.tcl, effectively redefining ad_returnredirect to be verbose.

d_proc -public ad_returnredirect {
    {-message {}}
    {-html:boolean}
    {-allow_complete_url:boolean}
    target_url
} {
    Write the HTTP response required to get the browser to redirect to
    a different page, to the current connection. This does not cause
    execution of the current page, including serving an ADP file, to
    stop. If you want to stop execution of the page, you should call
    ad_script_abort immediately following this call.

    <p>

    This proc is a replacement for ns_returnredirect, but improved in
    two important respects:
    <ul>
    <li>
    When the supplied target_url isn't complete, (e.g. /foo/bar.tcl or
    foo.tcl) the prepended location part is constructed by looking at
    the HTTP 1.1 Host header.
    </li>
    <li>
    If a URL relative to the current directory is supplied
    (e.g. foo.tcl) it prepends location and directory.
    </li>
    </ul>

    @param message A message to display to the user. See
                   util_user_message.

    @param html Set this flag if your message contains HTML. If
                specified, you're responsible for proper quoting of
                everything in your message. Otherwise, we quote it for
                you.

    @param allow_complete_url By default we disallow redirecting to
                              URLs outside the current host. This is
                              based on the currently set host header
                              or the hostname in the config file if
                              there is no host header. Set
                              allow_complete_url if you are
                              redirecting to a known safe external web
                              site. This prevents redirecting to a
                              site by URL query hacking.

    @see util_user_message
    @see ad_script_abort
} {
    ad_log warning "ad_returnredirect allow_complete_url $allow_complete_url target_url <$target_url>"
    if {$message ne ""} {
        #
        # Leave a hint, that we do not want to be consumed on the
        # current page.
        #
        set ::__skip_util_get_user_messages 1
        util_user_message -message $message -html=$html_p
    }

    if { [util_complete_url_p $target_url] } {
        ns_log notice "ad_returnredirect is complete <$target_url>"
        # http://myserver.com/foo/bar.tcl style - just pass to ns_returnredirect
        # check if the hostname matches the current host
        if {[util::external_url_p $target_url] && !$allow_complete_url_p} {
            error "Redirection to external hosts is not allowed."
        }
        set url $target_url
    } elseif { [util_absolute_path_p $target_url] } {
        #
        # The URL is an absolute path such as: /foo/bar.tcl
        #
        set url [expr {[::acs::icanuse "relative redirects"] ? "" : [util_current_location]}]
        append url $target_url
        ns_log notice "ad_returnredirect path is absolute, updated URL <$url>"
    } else {
        #
        # URL is relative to current directory.
        #
        set url [expr {[::acs::icanuse "relative redirects"] ? "" : [util_current_location]}]
        append url [ad_urlencode_folder_path [util_current_directory]]
        if {$target_url ne "."} {
            append url $target_url
        }
        ns_log notice "ad_returnredirect path is relative, updated URL <$url>"
    }

    # Sanitize URL to avoid potential injection attack
    regsub -all -- {[\r\n]} $url "" url

    ns_log notice "ad_returnredirect final redirect to <$url>"
    ns_returnredirect $url
}

I can't exclude that NaviServer 4.99.31 might contribute to the problem. To try with NaviServer 5, rebuild with

sudo with_debug_flags=1 version_ns=GIT build_dir=/usr/local/ns5-src \
     bash install-ns.sh build

all the best
-g

Collapse
Posted by Michael Steigman on
Thanks for the suggestions. In trying to build NS from git, I ran into this error:

tls.c: In function ‘Ns_TLS_CtxClientCreate’:
tls.c:1309:9: error: unknown type name ‘SSL_verify_cb’
SSL_verify_cb verifyCB = NULL;
^

A little searching led me to install openssl11 and openssl11-devel to pick up this new type name. However, I haven't been able to instruct NS to use the newer version. I tried exporting OPENSSL_CFLAGS, OPENSSL_LIBS, CPPFLAGS and LDFLAGS along with modifying the script with --with-openssl=/usr.

Any ideas on how to move past this?

Collapse
Posted by Michael Steigman on
I was able to move past that issue with SSL. I installed the optional openssl11 packages then went in and modified src/naviserver/include/Makefile.global

and changed the following lines to reference the newer version.

OPENSSL_LIBS = -L/usr/lib64/openssl11 -lssl -lcrypto
CFLAGS += -I/usr/include/openssl11 -I/usr/include

I am in the process of trying to work through some Tcl errors but do not seem to be dealing with redirects under NS5 any more.

I will follow up with any other questions as I come across them. Thanks.

Collapse
Posted by Gustaf Neumann on
New Insight: Infinite Redirection Caused by Root‑Node Permission Drop

We’ve identified a likely root cause of the infinite redirection issue: when the read permission on the top‑level site node (/) is unintentionally removed, anonymous visitors get stuck in a redirect loop.

What Happened

  1. Navigate to the “/” site‑map permissions form.
  2. Click Confirm Permission Settings without making any changes.
  3. A bug prevented direct (read‑only) permissions from being resubmitted, so they were dropped.
  4. As a result, anonymous users see: “The page isn’t redirecting properly”

This issue can happen, when the read permissions are removed from the top-level site-node entry (/). These permissions were erroneously dropped, when submitting the “/” site‑map permissions form without any changes. There was a bug that removed the permissions in this situation.

Huge thanks to Khy H for reporting this bug, which is already in the OpenACS 5.10.1 release!

The problem is fixed is in the main and oacs-5.10 branches, and is tagged with openacs-5-10-compat. Users upgrading the acs-subsite package from the repository via will automatically receive the patch (starting tomorrow, after the nightly rebuild of the repository archives).

See full details at:
https://openacs.org/bugtracker/openacs/bug?bug_number=3477

Collapse
Posted by Michael Steigman on
Making progress on our upgrade. One issue I wanted to run by you Gustaf relates to the upgrade logic for xowiki. We're at v0.89 from 5.8 days. Our xotcl core is at 0.126 and we're upgrading to Naviserver 5 with Nsf 2.4.

The Tcl upgrade logic has a lot of these types of upgrades:

::$package_id import-prototype-page "page-type"

Ours is failing at the first version upgrade that includes an invocation of this method - 0.96.

::11675808: unable to dispatch method 'import-prototype-page'
while executing
"error "[self]: unable to dispatch method '$m'""

Hoping you can point us in the right direction on this.

Collapse
Posted by Gustaf Neumann on

Hi Michael,

I did a bit of software archaeology, and the issue seems to be rooted in a very old upgrade path.

Some relevant points:

Regarding the upgrade path: The correct way to upgrade from OpenACS 5.4 to 5.10 is to go step by step through the intermediate releases:

5.4 → 5.5 → 5.6 → 5.7 → 5.8 → 5.9 → 5.10

If these steps are skipped, upgrade scripts will run in an environment that lacks the expected set of functions and APIs for that release. In that case, it’s not surprising that functions like this one fail - and it’s quite likely that additional issues will surface as well.

That said, I’ve added backward-compatibility procedures for the www-* methods to both the HEAD and oacs-5-10 branches. If you refresh your checkout, thes procs might ease your upgrade, but it should be seen as a temporary aid rather than a substitute for a proper staged upgrade.

Best regards,
Gustaf

Collapse
Posted by Michael Steigman on
Thanks for the help, Gustaf!

We are running on top of a base of 5.8.0. If my memory is correct, we hit some issues bringing xowiki up to the version shipped with 5.8 and left it where it was as it was working.

So we aren't trying to make a massive leap from 5.4. We did make an attempt to move to 5.9.1 but ran into unrelated stack issues (which resulted in the creation of this thread) abandoned that in favor of jumping to 5.10. We have seen some inconsistencies across the upgrades even within versions. For example, 5.9 removes acs_object_context_index but we hit scripts in the 5.9 series that were still referencing it, sometimes indirectly. (If curious, I can share a diff...)

We eventually decided to try to move directly to 5.10 and that's where we are at the moment. Aside from the Xowiki issues and an adjustments to DML to account for the acs_object_context_index issues mentioned, we're very close.

Is there anything else we'd need for those deprecated aliases to work? We hit a new error when upgrading Xowiki with the deprecated procs in place:

invalid non-positional argument '-with_child_rels', valid are: -dbn, -name, -parent_id, -item_id, -locale, -creation_date, -creation_user, -context_id, -creation_ip, -item_subtype, -content_type, -title, -description, -mime_type, -nls_language, -text, -data, -relation_tag, -is_live, -storage_type, -package_id;
should be "::acs::dc call content_item new ?-dbn /value/? -name /value/ ?-parent_id /int32/? ?-item_id /int32/? ?-locale /value/? ?-creation_date /value/? ?-creation_user /int32/? ?-context_id /int32/? ?-creation_ip /value/? ?-item_subtype /value/? ?-content_type /value/? ?-title /value/? ?-description /value/? ?-mime_type /value/? ?-nls_language /value/? ?-text /value/? ?-data /value/? ?-relation_tag /value/? ?-is_live /value/? ?-storage_type /value/? ?-package_id /int32/?"
::acs::dc ::acs::db::nsdb-postgresql->call
invoked from within
"::acs::dc call content_item new -name ${:name} -parent_id ${:parent_id} -creation_user $creation_user -creation_ip $creatio..."
("uplevel" body line 25)

We also got the following error several times, when packages invoked a Tcl callback (I believe):

invalid command name "::xowiki::Package"
while executing
"::xowiki::Package is_xowiki_p $package_id"
(procedure "::callback::subsite::parameter_changed::impl::xowiki" line 3)

Thanks for all of your help!

Collapse
Posted by Gustaf Neumann on
Hi Michael,

The version numbers you provided suggest a slightly different situation. A cherry-picked installation (as I understand from your last post) tends to make upgrades more complicated rather than easier.

The error about the missing `with_child_rels` flag is most likely due to an incomplete upgrade of the Content Repository. The `with_child_rels` option was introduced with OpenACS 5.9 (roughly 11 years ago). During startup, **xotcl-core** reads the definitons of the stored procedures present in the database and exposes them via its API. If `-with_child_rels` is not available in this API, this strongly indicates that the corresponding database procedures were not upgraded.

It will probably be easier to first upgrade **acs-core** cleanly to 5.10 and only afterward upgrade the application packages. This staged approach usually makes it much clearer where a problem originates.

The missing `::xowiki::Package` class points to a startup failure as well: the startup file likely aborted with an error before the class definition was reached.

Collapse
Posted by Michael Steigman on
Thanks for the continued hand-holding! :) Upgrade of core alone followed by a restart got us further. Core upgraded cleanly.

Upon restart, we upgrade Xotcl-core and Xowiki together. Those two pulled in a few other packages as well. On the upgrade, we hit this error:

Notice: ### db_with_handle returned error <Database operation "0or1row" failed (exception ERROR, "ERROR: invalid input syntax for type integer: "top_portlet"
: LINE 4: where parameter_id = 'top_portlet';
: ERROR: invalid input syntax for type integer: "top_portlet"
: LINE 4: where parameter_id = 'top_portlet';

Looking at Tcl upgrade logic for Xowiki, there are a few instances of "copy/delete_parameter top_portlet" in the upgrade path. I checked the source of the utility proc but I don't have the context to make a guess as to what could have caused this specific issue. Any ideas?

Also, is there any specific ordering of Xotcl-core and Xowiki that's helpful during an upgrade?

Thanks again!

Collapse
Posted by Gustaf Neumann on

Hi Michael — nice, that is good news.

About the top_portlet error: this looks like a bug/ambiguity in apm_parameter_unregister. The generated SQL tries to match

where parameter_id = 'top_portlet'

so Postgres complains because parameter_id is an integer.

This is likely triggered by XoWiki’s upgrade step to 0.120, which includes:

set v 0.120
if {[apm_version_names_compare $from_version_name $v] == -1 &&
    [apm_version_names_compare $to_version_name $v] > -1} {
  ns_log notice "-- upgrading to $v"
  delete_parameter top_portlet
}

Historically, an earlier upgrade (towards ~0.79, ages ago) renamed/copied the parameter from top_portlet to top_includelet. The later step (0.120 - 17 years ago) just tries to delete the obsolete name. Everything works fine even if this cleanup doesn’t happen; it’s housekeeping.

I’ve fixed the underlying apm_parameter_unregister issue in CVS HEAD and in oacs-5-10. So the practical fix is: update to a revision that includes this patch (acs-tcl/APM), then re-run the upgrade.

If you want to inspect your current DB state, you can check whether these parameters exist and what they contain:

select v.package_id, p.parameter_name, v.attr_value
from   apm_parameters p
join   apm_parameter_values v on (v.parameter_id = p.parameter_id)
where  p.parameter_name = 'top_portlet';

and:

select v.package_id, p.parameter_name, v.attr_value
from   apm_parameters p
join   apm_parameter_values v on (v.parameter_id = p.parameter_id)
where  p.parameter_name = 'top_includelet';

Regarding ordering: no worries there — OpenACS/APM computes the upgrade order from dependencies, so you don’t need to handle xotcl-core vs. xowiki sequencing manually.

all the best -g

Collapse
Posted by Michael Steigman on
I think we're in the clear! I was able to upgrade core and then Xo* without errors. I did get the following error, which appears to have occurred after the APM logic and the package version update (note the "Package enabled"):

Package enabled.Error: required argument 'key' is missing, should be: ::xo::xotcl_object_type_cache flush ?-partition_key /value/? /key/

Is this something you've seen before?

Thanks again!

Collapse
Posted by Gustaf Neumann on
Hi Michael,

these are very good news - congratulations on getting this far!

The message about the missing required argument is important and
should not be ignored. It is triggered by an upgrade script for
xowiki.127 (released about 17 years ago), where the cache is
flushed. The main issue is that this exception aborts the upgrade
process, which in turn may prevent later upgrade scripts from being
executed.

The root cause is a change in the caching infrastructure that happened
roughly ten years ago. At that time, OpenACS introduced partitioned
caches for improved scalability. With partitioned caches, a simple
flush operation now requires a key to determine the
partition. However, for the intended use case in this old upgrade
script, the correct operation is actually flush_all, which does not
require a key.

I have fixed this issue both in CVS HEAD and in the oacs-5-10 branch,
so upgrading from there should no longer hit this problem.

Please let me know if you run into anything else - upgrading a
methusalem system is always an adventure!

all the best
-g