Forum OpenACS Q&A: connection socket is detached

Collapse
Posted by Claudio Pasolini on
I recently upgraded a customer installation from an old oacs-5-5 to an intermediate oacs-5-8 and from pg-8.2 to pg-9.6 on a new Debian 10 server with the same resources as the old server.

In general all went well, but sometimes a program don't comes to an end, finishing with a blank screen. The program uses an external binary via exec to produce a pdf that must be returned to the connection.

Once the problem appears, it persists. Restarting Naviserver resolves the problem, but with time the problem reappears.

In the error log there is the following:

: connection socket is detached
: while executing
: "ns_return 500 text/html $rendered_page"
: (procedure "rp_report_error" line 43)
: invoked from within
: "rp_report_error"
: (procedure "rp_handler" line 147)
: invoked from within
: "rp_handler"
: while executing callback
: ns:tclrequest rp_handler
: (context: request proc)

As a first measure I increased the swap file, but it seems that the swap memory is only partially used, so there must be some other bottleneck or race conditions.

The PostgreSQL log doesn't show any clue. The server runs naviserver-4.99.22 and tcl8.6.

Any suggestion would be very appreciated.

Claudio

Collapse
Posted by Gustaf Neumann on

Dear Claudio,

Probably you are facing different issues. Let's first look at "connection socket is detached", what it means, why this error makes sense, and how to avoid it.

Look at the following script:

 #
 # script www/test-me.tcl
 #
 if {[ns_queryget stop 0]} {
    ns_return 200 text/plain "break out"
 }
 set msg "end of script"
 ns_return 200 text/plain  $msg

When test-me.tcl is called from the browser without parameter, it returns a page with "end of script". When it is called with the query parameter stop=1, it enters the if branch and returns "break out" to the client. The important thing is after the ns_return: in both AOLserver and NaviServer, the script continues to execute and reaches (after probably some more commands) the 2nd ns_return. However, this 2nd ns_return can't work, since the script has already replied to "break out" to the user and cannot reply on the same connection once again. In general, the continuation after the command can lead to situations and strange error messages which are hard to debug.

Therefore, NaviServer raises since version 4.99.16 (released 2017-12-29) an exception in such situations, telling the user that this script can't work as programmed.

The continuation after ns_return and similar commands is probably not expected and not wanted by the developer. The right thing to address this issue is to add a return (in an executing script) or a ad_script_abort in library functions.

 #
 # script www/test-me.tcl
 #
 if {[ns_queryget stop 0]} {
    ns_return 200 text/plain "break out"
    return   ;# HERE IS THE CHANGE
    #ad_script_abort
 }
 set msg "end of script"
 ns_return 200 text/plain  $msg

In all the maintained packages (~100) in the public OpenACS repository, such issues has been fixed.

It is possible to avoid raising an exception in such cases by setting in the NaviServer configuration file in section ns/params the parameter rejectalreadyclosedconn to false (see e.g. in [2]). This is not recommended. There is as well a page in the wiki [3] addressing this situation.

Hope this helps
-g

[1] NaviServer 4.99.16
[2] OpenACS configuration file
[3] connection-already-closed

Collapse
Posted by Claudio Pasolini on
Dear Gustaf,

thank you for the pointer. Actually one of the programs made use of ad_progress_bar_end (with redirect) without calling ad_script_abort: fixing it solved the problem of the detached socket.

Unfortunately the program that presents the problem more frequently can't be fixed in the same way.

The program prints one or more delivery notes. It creates one file (possibly quite large), converts it to pdf calling via exec the external binary trml2pdf (so making use of ns_proxy), sends the pdf via email to the client and finally sends the pdf to the connection via ns_writer and then aborts. In case of multiple delivery notes, before sending the pdf to the connection, the program calls another external binary (pdftk) to concatenate the various pdf produced.

Apparently there are no intermediate ns_return that can trigger the problem. Restarting Naviserver makes the problem temporary disappear and there is no easy way to reproduce the error.

I suspect a possible race condition within ns_proxy. I'll try to get some stats logging the result of 'ns_proxy stats exec_proxy' before and after the exec calls.

Claudio

Collapse
Posted by Gustaf Neumann on
You have to watch out for all kind of commands trying to write to the console (including redirects, etc.). If you can come up with a simple test-case, I could have a look into this...
Collapse
Posted by Claudio Pasolini on
The socket is detached immediately after sending the pdf to the connection via:
ns_writer submitfile $path

After that it continues deleting some temporary files and then aborts without any further ns_return.

The error in the log is somewhat confusing, but what worried me was the occasional malfunction of trml2pdf.

Thank you again,
Claudio

Collapse
Posted by Maurizio Martignano on
Dear Claudio and Gustaf,
being this a legacy script, running also on Aolserver, I believe that the delete operations called after
"ns_writer submitfile $path"
are implemented via "ns_unlink".

But calling "ns_unlink" on Naviserver 4.99.22 generates the exception/message about the function being deprecated.

It is this exception/message that clashes with "ns_writer" and causes the error "connection socket is detached".

Replacing "ns_unlink" with "file delete" should avoid the exception/message about the function being deprecated and therefore should make the error "connection socket is detached" disappear.

Hope it helps,
Maurizio

Collapse
Posted by Gustaf Neumann on
Maurizio,
being this a legacy script, running also on Aolserver, I believe
that the delete operations called after
"ns_writer submitfile $path" are implemented via "ns_unlink".

About which script are you talking? Aolserver did not have an "ns_writer" command. In normal life, "ns_unlink" does not raise an exception, but writes a depreciation message to the log file.

Collapse
Posted by Maurizio Martignano on
Dear Gustaf,
I would ask Claudio, as quick test, to replace in his script all the "ns_unlink" calls after "ns_writer" with "file delete" and report if anything changes.

Then we'll have more information on the actual case at hands.

Cheers,
Maurizio

Collapse
Posted by Claudio Pasolini on
Dear Maurizio,

you are wrong. I put the following line immediately before and after ns_writer:
ns_log notice "\nThe connection is [ad_conn isconnected]"
and as I already said the connection is detached immediately after ns_writer.

Anyway it's a good idea to replace the deprecated calls with the suggested ones.

Claudio

Collapse
Posted by Maurizio Martignano on
Dear Claudio and Gustaf,
as Claudio knows I have access to his script.

Now replacing "ns_unlink" with "file delete" as I mentioned
produces the following log
"
ns_proxy proxies 8 waiting 0 maxworkers 8 free 8 used 0 requests 602 processes 2 runtime 600.150454
[15/Mar/2022:09:08:07][3830.7f68da7cd700][-conn:trinita:default:3:4841-] Notice:
: ns_proxy proxies 8 waiting 0 maxworkers 8 free 8 used 0 requests 603 processes 2 runtime 600.607638
[15/Mar/2022:09:08:07][3830.7f68da7cd700][-conn:trinita:default:3:4841-] Warning: ns_rename /tmp/delivery-1175494-6288602.pdf /tmp/delivery-1175494.pdf is deprecated. Use 'file rename' instead!
"

that is no "connectIon socket is detached".

On the contrary, going back to "ns_unlink" , produces this log:
"
[15/Mar/2022:09:14:28][3830.7f68da7cd700][-conn:trinita:default:3:4849-] Warning: ns_unlink /tmp/delivery-1175494-6288602.tmp.rml is deprecated. Use 'file delete' instead!
[15/Mar/2022:09:14:28][3830.7f68da7cd700][-conn:trinita:default:3:4849-] Error: GET /mis-sales/deliveries/print, PeerAddress: 92.222.178.125
: connection socket is detached
: while executing
: "ns_return 500 text/html $rendered_page"
: (procedure "rp_report_error" line 43)
: invoked from within
: "rp_report_error"
: (procedure "rp_handler" line 147)
: invoked from within
: "rp_handler"
: while executing callback
: ns:tclrequest rp_handler
: (context: request proc)
"
And this is a fact.

Now you both have more information to take into consideration...

Again, hope it helps.

Collapse
Posted by Gustaf Neumann on
Maybe, you guys sort this out internally, and tell me, when i can help. The message above indicates that the deprecated ns_rename is used as well....

The problem is that the script executed a "ns_return 500" triggered by "rp_report_error", which cannot work, when the connection is already gone.

Collapse
Posted by Maurizio Martignano on
Dear Gustaf,
eventually we modified the script replacing "ns_unlink" with "file delete".
This is what we have at the moment in the script:
...
ns_writer submitfile $path
# here it seems the connection gets closed
# file delete $some_other_path #is ok
# ns_unlink $some_other_path #produces the error message reported before

Hope this helps,
Maurizio

Collapse
Posted by Gustaf Neumann on
Glad, that everything is fine with your script. A few comments to this snippet

ns_writer submitfile $path
# here it seems the connection gets closed

It is rather unlikely that at the place of your comment, the connection is closed, but it is detached. What does this mean: the command "ns_writer submitfile" passes the control to a writer thread (assuming you have writer threads defined), which will asynchronously deliver the file content of this connection to the client. It is unlikely that at the time when you execute the next command in this script, the delivery is already finished. The duration of the delivery depends on the client and its connection: when the client is e.g. connected via smartphone over a cellular network, this will take longer than a user connected from the same LAN.

# file delete $some_other_path #is ok
# ns_unlink $some_other_path #produces the error message reported before

The deprecated command "ns_unlink" has three conditions [1], when it raises an exception (wrong number of arguments, trying to delete a directory, trying to delete a non-existing file) - all of these look reasonable to me, but probably, one of these fired in your case.... and for reason not visible from your report "file delete" does not raise these (note, that "ns_unlink" executes as last command "file delete" as well).

When the command raises an exception in a connection thread in detached state, "rp_handler" will catch the exception, and it tries to deliver the error message on the detached connection. This can't work, since the writer thread is at the same time busy to deliver the file on this connection.

Therefore, OpenACS got a change ~16 months ago to avoid the error message you have reported via change [2]. Probably, you are using some older version.

-gn

[1] https://bitbucket.org/naviserver/naviserver/src/bdaf38422b0da8ffd96623fb569ca131c6a1ea3e/tcl/compat.tcl#lines-159
[2] https://github.com/openacs/openacs-core/commit/20ac35a36395bd0bfd7e378a7192a4a359ab5c39

Collapse
Posted by Maurizio Martignano on
Dear Gustaf,
thank you for your analysis and explanations.
Maurizio