Forum OpenACS Q&A: limit size with util::http::get

Collapse
Posted by Jeff Rogers on
Is there a way to limit the retrieved content size when using util::http::get ?

My setup is that users are provided with a way to specify a link to an external home page, and when they provide that I use http::get to pull the page and parse it with 'dom parse' so that some html or opengraph elements (title, description) can be pulled to decorate the link where displayed. This was working fine until one user specified a link to a dropbox folder; when you visit the link in a browser you get a file folder view, but when you request it with http::get or curl (presumably absent some magic user-agent header) it turns into a zipped download of the folder, which was several hundred MB. Loading that into an ordinary variable ran the server out of memory, and the system killed it.

How can I prevent this from happening? The only apparent solution is to use the -spool option to make the call use a spool file instead of returning the page content directly, although even that could be problematic without some guardrails (I'd need to ensure the temporary directory has sufficient space for whatever gets linked to). Making a HEAD request and declining to GET the url if the content is not text/html or if the size is too big would help with well-behaved servers, but is no guarantee. Specifying 'Accept: text/html" doesn't help either. Is there some other way to direct http::get to stop downloading after some maximum reasonable file size?

Collapse
Posted by Gustaf Neumann on
Hi Jeff, nice to hear from you.

No, currently there is no such feature, but it certainly would make sense.
Concerning spooling the content to memory, NaviServer has a limit (maxupload) to decide to write the content into memory or to a spool file. In the latter case, just the disk space is the limit. ns_http (which is used by default by util::http::get) has as well a limit for spooling (spoolsize), takes the tmp directory from the configuration file, and is trying to make sensible defaults. However, when someone transfers a huge file (say in the TB range), you do not want to fill up all the disk space. Therefore, limiting makes certainly sense.

Limiting the size would be in some cases easy: When the content-size is available, the transmission can be stopped without the need to transfer all data.

However, to address this perfectly is not so easy. There are at least the following complications:

  1. Many request return no content-length (streaming HTML, chunked transfer encoding)
  2. compressed content (the transfer size will not be the same as the content on disk).

In these cases, it is necessary to transfer the content and to make the decision based on the transferred and/or decoded content.

What i have checked, both nginx and Apache offer some options. NGINX and Apache rely on Content-Length, buffer limits, and timeouts for controlling data size, but they struggle with compressed data, chunked transfers, and streaming because they can't easily determine the uncompressed or total size of data at the start of the transfer.

What is your major concern? As said, spooling to a file works with existing NaviServer, the measuring approach requires some extension.

Collapse
Posted by Gustaf Neumann on
Hi Jeff,

as a first shot, i've added the flag "-maxresponse" to ns_http to provide response size limit. Currently, it just compares the provided value with the received value of "Content‐Length" and raises an exception, when the size is exceeded. Adding cases where no content-length is provided, will follow.

% ns_http run -maxresponse 10KB https://orf.at
response limit exceeded

I am not fully happy with the name of the option. First i had "-responselimit", but i changed this to max*, since we have the max* convention in NaviServer also on many other places (e.g. maxinput, maxupload, ...). Naming these in 5.0 "uploadlimit" or "spoollimit" would be more self-explaining, we could make the old names deprecated, but i am not sure about the full consequences.

Collapse
Posted by Gustaf Neumann on
I have extended now the semantics in NaviServer to handle also the cases, where no content-length is provided:
  • If the response includes a Content-Length header, the value is used for comparison, and the request is stopped after processing the header.
  • If no Content-Length header is present, the request is canceled once the number of received bytes exceeds the specified value.
Collapse
Posted by Michael Aram on
Sounds great!

I could imagine cases though, where one wants to receive only the "head" of a file even if there is a content-length header. So maybe the decision regarding the variant could be left to the user?

Just an idea. All the best!

Collapse
Posted by Gustaf Neumann on
Typically, you have upfront no idea, whether the reply from the server will be a streaming response, or chunked-encoding, or an HTTP/1.0 style server without a content-length.

One can always send a HEAD request via ns_http and make further decisions based on the result. In the (hopefully) seldom case, where the server does not support the HEAD request, there is also the possibility to set the "response_header_callback" and terminate the request via exception from the callback proc.

Does this address your concern?