Forum OpenACS Q&A: Re: NaviServer stops serving connections when slow external api calls pile up.

Thank you for your earlier responses. I have sifted through our earlier logs to gather more information to, hopefully, more precisely describe the issue we are seeing.

To restate: we are seeing issues where ns_http does not timeout and return after the specified expire value. We cannot seem to manually, intentionally reproduce the problem. When it happens, it seems that we are either having network issues or there is a problem with the target system but, we have not been able to identify these details during an incident. It is precisely these conditions under which we would want ns_http to be able to timeout and abandon the request, releasing the attempted connection to the target system, and returning to finish processing the current request to NaviServer.

Instead we experience ns_http not timing out with more and more requests stacking up and the system getting bogged down. The current fix when this occurs, since we have access, is to restart the target system. Once we do, NaviServer returns to its functioning state.

So, in short, the timeout works if all is well with the network and the target system. If the request just takes longer than desired, it will timeout. However, if there are "other problems", the timeout does not work.

After your last comment about the -expire and the -timeout arguments to ns_http, I have done a considerable amount of testing where I am able to verify that they both work as described in the documentation, under normal, functioning conditions. In my case, I have found the successful threshold using -timeout 134ms and -expire 152ms. Shortening either of these values by 1 or 2 ms will normally cause the call to timeout. So I know that the timeout does work under normal conditions.

With all of that said, here are some log entries that show the problem as we experienced it during the last event. There are several requests to the target system shown. There are timings reported and the response from the call. These are listed sequentially as they were logged. Any log entries that did not pertain to the call in question are replaced with ... (3 dots). The call to the target system is made inside a process that gathers and returns a bunch of information. The log includes a run_num: that is a unique, random string to identify the process run (this was added to try to debug this issue), a start run time: and end run time: which is the time from the start of the process to right before the call is made and right after the call returns, respectively, the call: being made, the error (if there was one), and the dur:, the duration of the call (which is the difference between the start and end times). There are comments added before each process instance indicating if it was successful or not and whether the result was expected or not. This shows that the system went from timing-out correctly to "ignoring" the timeout and not returning for up to 30 seconds.

I will dissect the first instance in an attempt to explain the information provided and to set the pattern that the others follow.

######  -- Correct timeout
[04/Jan/2023:05:07:43][1.7fb883fff700][-conn:east_intra:default:19:4500-] Notice: pkg-notices: in_box_notices; run_num: 03A965C1; start run time: 1537 ms; call: wt::jira::notice::get_notice

The above, first entry, indicates that the response was Correct and did actually timeout appropriately. It shows the time in the process right before the call is made: start run time: 1537 ms; (::get_notice contains the call to ns_http)

[04/Jan/2023:05:07:44][1.7fb883fff700][-conn:east_intra:default:19:4500-] Error: The request to Jira expired after the specified length: 500ms, for user: user1;
:     status:
:     result: http request timeout
:     error_info: http request timeout
:        while executing
:    "ns_http run -method GET -headers $request_headers -expire $timeout $url"
:        ("::try" body line 4)
[04/Jan/2023:05:07:44][1.7fb883fff700][-conn:east_intra:default:19:4500-] Notice: The request to Jira for approval notices did not contain a group element. error message: {The request to Jira expired}

The above few lines show the result of the call. This call did in fact timeout and the response contained the result error information about the http request timing out.

[04/Jan/2023:05:07:44][1.7fb883fff700][-conn:east_intra:default:19:4500-] Notice: pkg-notices: in_box_notices; run_num: 03A965C1; end   run time: 2071 ms; call: wt::jira::notice::get_notice; dur: 534 ms

The above line shows the time right after the call which indicates the end run time: 2071 ms which then has a corresponding duration of dur: 534 ms. To make sure that this "end" entry corresponds with the above "start" entry, the run_num: 03A965C1 identifies these entries as belonging to the same "process instance". Yes, this is slightly longer than the 500 ms for the timeout but, there is the surrounding call and return that can account for that additional time.

The following are additional log excerpts showing subsequent calls with their entries following the same pattern as this one just explained.

######  -- Correct timeout
[04/Jan/2023:10:28:12][1.7fb8cfc7e700][-conn:east_intra:default:4:64408-] Notice: pkg-notices: in_box_notices; run_num: D9D43629; start run time: 1630 ms; call: wt::jira::notice::get_notice
...
[04/Jan/2023:10:28:12][1.7fb8cfc7e700][-conn:east_intra:default:4:64408-] Error: The request to Jira expired after the specified length: 500ms, for user: user2;
:     status:
:     result: http request timeout
:     error_info: http request timeout
:        while executing
:    "ns_http run -method GET -headers $request_headers -expire $timeout $url"
:        ("::try" body line 4)
[04/Jan/2023:10:28:12][1.7fb8cfc7e700][-conn:east_intra:default:4:64408-] Notice: The request to Jira for approval notices did not contain a group element. error message: {The request to Jira expired}
[04/Jan/2023:10:28:12][1.7fb8cfc7e700][-conn:east_intra:default:4:64408-] Notice: pkg-notices: in_box_notices; run_num: D9D43629; end   run time: 2162 ms; call: wt::jira::notice::get_notice; dur: 532 ms

######  -- Successful run NO timeout
[04/Jan/2023:13:18:42][1.7fb8cdc7a700][-conn:east_intra:default:8:108041-] Notice: pkg-notices: in_box_notices; run_num: DDF11C73; start run time: 1606 ms; call: wt::jira::notice::get_notice
...
[04/Jan/2023:13:18:42][1.7fb8cdc7a700][-conn:east_intra:default:8:108041-] Notice: pkg-notices: in_box_notices; run_num: DDF11C73; end   run time: 2034 ms; call: wt::jira::notice::get_notice; dur: 428 ms

######  -- Bad run, does not appear to timeout but has a duration of longer than 500 ms and no timeout  It is possible that the http call timed out but, something else got stuck or that it did not log the error response correctly. Wat!
[04/Jan/2023:13:18:51][1.7fb8a6ffd700][-conn:east_intra:default:21:108096-] Notice: pkg-notices: in_box_notices; run_num: 4F6A7C4C; start run time: 1708 ms; call: wt::jira::notice::get_notice
...
[04/Jan/2023:13:18:55][1.7fb8a6ffd700][-conn:east_intra:default:21:108096-] Notice: pkg-notices: in_box_notices; run_num: 4F6A7C4C; end   run time: 5552 ms; call: wt::jira::notice::get_notice; dur: 3844 ms {{no timeout?}}

######  -- Bad run timed out but long after the 500 ms
[04/Jan/2023:13:19:45][1.7fb8cec7c700][-conn:east_intra:default:5:108393-] Notice: pkg-notices: in_box_notices; run_num: 2C241021; start run time: 1562 ms; call: wt::jira::notice::get_notice
...
[04/Jan/2023:13:19:55][1.7fb8cec7c700][-conn:east_intra:default:5:108393-] Error: The request to Jira expired after the specified length: 500ms, for user: user3;
:     status:
:     result: http request timeout
:     error_info: http request timeout
:        while executing
:    "ns_http run -method GET -headers $request_headers -expire $timeout $url"
:        ("::try" body line 4)
[04/Jan/2023:13:19:55][1.7fb8cec7c700][-conn:east_intra:default:5:108393-] Notice: The request to Jira for approval notices did not contain a group element. error message: {The request to Jira expired}
[04/Jan/2023:13:19:55][1.7fb8cec7c700][-conn:east_intra:default:5:108393-] Notice: pkg-notices: in_box_notices; run_num: 2C241021; end   run time: 11185 ms; call: wt::jira::notice::get_notice; dur: 9623 ms

######  -- Bad run timed out but long after the 500 ms
[04/Jan/2023:13:19:49][1.7fb8cdc7a700][-conn:east_intra:default:8:108439-] Notice: pkg-notices: in_box_notices; run_num: D9370223; start run time: 1648 ms; call: wt::jira::notice::get_notice
...
[04/Jan/2023:13:20:02][1.7fb8cdc7a700][-conn:east_intra:default:8:108439-] Error: The request to Jira expired after the specified length: 500ms, for user: user4;
:     status:
:     result: http request timeout
:     error_info: http request timeout
:        while executing
:    "ns_http run -method GET -headers $request_headers -expire $timeout $url"
:        ("::try" body line 4)
[04/Jan/2023:13:20:02][1.7fb8cdc7a700][-conn:east_intra:default:8:108439-] Notice: The request to Jira for approval notices did not contain a group element. error message: {The request to Jira expired}
[04/Jan/2023:13:20:02][1.7fb8cdc7a700][-conn:east_intra:default:8:108439-] Notice: pkg-notices: in_box_notices; run_num: D9370223; end   run time: 13709 ms; call: wt::jira::notice::get_notice; dur: 12061 ms

###### -- SSL connect failed and does not timeout for 30 seconds!
[04/Jan/2023:13:19:52][1.7fb8a7fff700][-conn:east_intra:default:11:108445-] Notice: pkg-notices: in_box_notices; run_num: B881E1A9; start run time: 1675 ms; call: wt::jira::notice::get_notice
...
[04/Jan/2023:13:20:23][1.7fb8a7fff700][-conn:east_intra:default:11:108445-] Error: The request to Jira appeared to succeed but, the result variable did not contain an http status.  Additional info follows:
:     status:
:     result: ssl connect failed: error:00000000:lib(0):func(0):reason(0)
:     error_info: ssl connect failed: error:00000000:lib(0):func(0):reason(0)
:        while executing
:    "ns_http run -method GET -headers $request_headers -expire $timeout $url"
:        ("::try" body line 4)
[04/Jan/2023:13:20:23][1.7fb8a7fff700][-conn:east_intra:default:11:108445-] Notice: The request to Jira for approval notices did not contain a group element. error message: {The request to Jira appeared to succeed but,}
[04/Jan/2023:13:20:23][1.7fb8a7fff700][-conn:east_intra:default:11:108445-] Notice: pkg-notices: in_box_notices; run_num: B881E1A9; end   run time: 32666 ms; call: wt::jira::notice::get_notice; dur: 30991 ms

######  -- Super bad run no ssl error, timeout but appears to be stuck for 4 minutes!
[04/Jan/2023:13:21:29][1.7fb8a7fff700][-conn:east_intra:default:11:108951-] Notice: pkg-notices: in_box_notices; run_num: E5E4A969; start run time: 1639 ms; call: wt::jira::notice::get_notice
...
[04/Jan/2023:13:25:30][1.7fb8a7fff700][-conn:east_intra:default:11:108951-] Error: The request to Jira expired after the specified length: 500ms, for user: user5;
:     status:
:     result: http request timeout
:     error_info: http request timeout
:        while executing
:    "ns_http run -method GET -headers $request_headers -expire $timeout $url"
:        ("::try" body line 4)
[04/Jan/2023:13:25:30][1.7fb8a7fff700][-conn:east_intra:default:11:108951-] Notice: The request to Jira for approval notices did not contain a group element. error message: {The request to Jira expired}
[04/Jan/2023:13:25:30][1.7fb8a7fff700][-conn:east_intra:default:11:108951-] Notice: pkg-notices: in_box_notices; run_num: E5E4A969; end   run time: 242778 ms; call: wt::jira::notice::get_notice; dur: 241139 ms

This last excerpt is from a recent run and shows what happens if the url being requested does not exist. This is another data point to show the behavior of ns_http under different conditions in an attempt to reproduce the problems we are seeing. This, as can be seen, returns with the error even before the timeout is reached, which is what I would expect it to do.

###### -- Different time completely but checking if the url cannot be resolved: using a non-existing url
[19/Jan/2023:04:51:40][1.7f1524b09700][-conn:tkirkham_intrapg14a:default:4:4660-] Notice: pkg-notices: in_box_notices; run_num: BA7BA78D; start run time: 1757 ms; call: wt::jira::notice::get_notice
[19/Jan/2023:04:51:40][1.7f1524b09700][-conn:tkirkham_intrapg14a:default:4:4660-] Warning: SockConnect could not resolve host jirabad.com
[19/Jan/2023:04:51:40][1.7f1524b09700][-conn:tkirkham_intrapg14a:default:4:4660-] Error: The request to Jira appeared to succeed but, had additional info. (run_num: BA7BA78D) The result variable did not contain an http s
tatus.  The request took: 3 ms, for user: user1;. Additional info follows:
:     status:
:     result: can't connect to jirabad.com port 443: reason unknown
:     error_info: can't connect to jirabad.com port 443: reason unknown
:        while executing
:    "ns_http run -method GET -headers $request_headers -timeout $timeout -expire $expire_after $url"
:        ("::try" body line 5)
[19/Jan/2023:04:51:40][1.7f1524b09700][-conn:tkirkham_intrapg14a:default:4:4660-] Notice: The request to Jira for approval notices did not contain a group element. error message: {The request to Jira appeared to succeed but, had additional info.}
[19/Jan/2023:04:51:40][1.7f1524b09700][-conn:tkirkham_intrapg14a:default:4:4660-] Notice: pkg-notices: in_box_notices; run_num: BA7BA78D; end   run time: 1765 ms; call: wt::jira::notice::get_notice; dur: 8 ms

As it can be seen from these log entries, there are times when the ns_http call does not appear to honor the specified -timeout or -expire. It is getting stuck somewhere. We are setting up the http logging capability to see if we can see any additional information but, we will be waiting for another event to occur to be able to get any additional data. Does this additional information give any new insight into what may be causing this and how to go about finding a solution?

Thank you very much for any help you can give.

-Tony

Hi Tony,

we will try to help. What I read from your posting:
- you have no reliable way to reproduce the problem (being able to reproduce the issue at our side would help significantly)
- Only some ns_http requests have "-timeout" specified
- are all requests over HTTPS?
- is there anything special about the jira service (running on a signe/multiple servers, restarts/reboots at the time when problems happen, ...)?

removed the http url from the logs below. which messed up some of the times below
I don't understand what "messed up" means here. If you do not want to reveal the URL on a public site, just replace it by "127.0.0.1". If there are very long URLs, this is also important for debugging. In the log above, there are just 408 status codes. Is this really all? I need as well the success cases to get a full picture, to see parallel requests, typical answer times, etc. Can you mail the log to me?

Do you have a monitoring of the VM running? It is interesting to see the VM-size, the memory statistics etc. There are cases, where the full VM hangs, where you cannot do anything from a single process point of view, when the full process does not get CPU at all. The example in https://openacs.org/forums/message-view?message_id=5539060 shows a case, where the full process is stopped for 6 seconds simply because of a fork() happening in a Tcl pipe-open command just depending on the memory size of the process. This blocked can be much larger depending on the circumstances, there can be many other factors influencing such cases (e.g., file-system, etc.). When the whole process is stopped by the kernel for, e.g., 10 seconds, setting a timeout of 1 second does have the desired effect. To identify such cases it is not sufficient to look at a single thread, but necessary to know what's happening in the other threads (if there is something happening) and maybe also, what's happening on the machine at this time.

What are your version numbers of
- OpenSSL
- NaviSever

all the best
-gn