Forum OpenACS Q&A: Re: NaviServer stops serving connections when slow external api calls pile up.

Hi Marty,
here is a summary of the client-log you sent to me:

What i read from this is:
- there are a substantial number of requests sent to the servers,
- a high number of the requests are perfectly fine, during a certain time window, there were troubles.

From the more detailed file analysis, i read:
- around 13:18:50, the response time of jira increased from to 2s, ... 8s.
- starting at 13:31:39, there are may 503 results from the jira, taking around 8 seconds.
- then (e.g. at 13:31:42, or 13:31:46), there are often 50 or more requests per second where some of these were very fast (<20ms) jira requests resulting in 503 from jira
- between 13:19:40 and 13:44:37, there were no HTTP client requests answered by jira with 200.

- also, when looking only at the successful requests, the runtime of these requests went up to 20s. (13:19:14: 19.6s, 13:19:40: 13s)

- The requests ending in 408 are distributed over the full day, but normally, these take ~500ms (the specified -expire time).
- Between 13:18:59 and 13:31:26, the reported timeout times are often substantially large (5s, 18s, up to 391s), although "-expire 500ms" was used.

So, one should probably concentrate first on figuring out, why the "-expire 500ms" flag was not honored in the successful requests, since this should be easier than trying to understand what happens with the timeout cases.

Do you have any information, what was the case with the jira system in this time window (hanging, overloaded, crashed)?
Talks NaviServer directly with jira, or is there a server between (e.g. a reverse proxy, or a load balancer).
What was the time window, when NaviServer hast stopped serving requests? Did it recover by itself, or did you restart it?

Hi Gustaf,

Yes, everything you have said is what we are experiencing and it looks like the data is showing that too.

Yes, focusing on -expire 500ms would be a good place to start - to see why it is not honored in these cases.

As for the last three questions you asked:

Do you have any information, what was the case with the jira system in this time window (hanging, overloaded, crashed)?

Answer: The Jira system was hanging and/or overloaded during the time frames above.

Talks NaviServer directly with jira, or is there a server between (e.g. a reverse proxy, or a load balancer).

Answer: We are talking directly to JIRA -- there is no proxy and no load balancer involved.

What was the time window, when NaviServer hast stopped serving requests? Did it recover by itself, or did you restart it?

Answer: We never did restart naviserver on Jan 4th - it recovered by it's self after JIRA was restarted.

Answer: Naviserver was un-responsive starting at 13:24pm - 13:31pm. Then we started to getting connections again at 13:32pm.