Forum OpenACS Q&A: MS Proxy Server and Caching Problem with OpenACS 4.2 Web Site

Hello All,

I guess most of you guys would already know me by now thru the
questions that I've been asking over the past few weeks.
I must reiterate how your answers have been most valuable and I am
very thankful to everyone who took the time to share their time and
know-how.

From all of those, however, I think this may  be the one that takes
the cake.

The web application I was developing went into testing just this
weekend over several locations nationwide.

Each location was accessing the web application mimicing the
expected load for the web site.

A very disturbing anomaly occurred to most of the users in each of
the locations. Each of the pages show the currently logged in user's
user name so it is clearly visible as to who is logged in.
Most of the users at one time or another during training found that
this username would all of a sudden be someone else's.

What's even more disturbing is that the username that appears in
place of the user's username is a user which is situated in a
different location.

For example. we have area1 with user-a1 . In area2, we have user-b1.

user-b1 while logged in would all of a sudden see the username of
user-a1 who is situated in area1.

However, when user-b1 refreshes his browser window the username will
become his own user name again.  This would happen intermittently
and in most cases a refresh does the trick.

We are suspecting a cache problem. Our suspiciions were confirmed
when we were informed that all of the locations were using one
common proxy server (MS Proxy 2.0)

One of the alternatives right now, that we have in mind is to pass a
PRAGMA NO-CACHE and Expires 0 to the headers but there is no
gurantee that the page will not be cached at all, furthermore this
may actaully cause the web application to load much slower.

There are about 3 different locations. Each location has around
30-35 users accessing the site. They are all using IE6.0 because we
need to use a special ActiveX control for one module of the web app.

I guess what I'd like to find out is

Are there any ideas or suggestions to over come the caching problem ?
Is there a setting in a proxy server that we can set to prevent this
from happening ?
If someone can enlighten me on how MS Proxy in particular or a proxy
server works in general, that would be super (though i have more or
less a hunch).
If someone's about to suggest "throw out the darn proxy server and
get SQUID" well they are but it won't be up and running till 2nd
quarter next year, so we need it to work with MS Proxy in the short
term.

Thank You

Hi all,

I'd like to follow up Ham's question.  Now, we're all trying to build beautiful, dynamic sites to be enjoyed by countless admirers around the world.

What if 2 or more such admirers are using the same HTTP proxy and trying to access www.ursite.com/myaccountinfo?

Assuming www.ursite.com/myaccountinfo is a typical dynamic ACS page and assuming that no special expires or pragma headers were set, could your typical proxy server (say Squid) cache said page and deliver the same version to all users?

I notice that Hotmail and even Yahoo embed a random string within their URLs, which I think is the session ID.  I thought this was for browsers that don't support cookies.  But is cache prevention another reason for this?  Could the ACS RP be hacked to do something similar?  Has anyone done this?  Perhaps this could be a configurable parameter?

We've built several ACS sites and have never worried about this issue.  However, with this project, it's definitely a problem.  Has anyone run into anything similar?  Is this of general concern?

Thanks!

Rob

If you're branding every page served with the name of the logged in user, you must prevent any shared cache from holding those pages. There's nothing wrong however with the pages being kept in a private cache such as the one your browser uses. You can achieve this by sticking
ns_set put [ns_conn outputheaders] "Cache-Control" "private"
ns_set put [ns_conn outputheaders] "Pragma" "no-cache"
in each page header, which at least tells sufficiently smart browsers that they can cache the page, but tells shared caches not to.

but there is no gurantee that the page will not be cached at all
If a shared cache sees both Pragma: no-cache and Cache-Control: private and still caches the page, then it is broken and nothing you can do on the web server will fix it. Tell whoever is responsible to not use a broken cache.

furthermore this may actaully cause the web application to load much slower
That's the price you pay for sticking the user's name on each page. Either you accept that personalising every page breaks cacheability, or you let cacheing happen and break the user experience.

We do see this with some of N2H2's Bess Proxies and some MS Proxies I think. This is ACS 3.4.x, we get complaints a lot with users claiming they see other user's names on our main page.

My take on that was it was a broken proxy, because doesn't the presence of a unique cookie change the headers and cause the proxy not to cache?

See http://www.arsdigita.com/bboard/q-and-a-fetch-msg?msg%5fid=000ZKi&topic%5fid=21&topic=web%2fdb on the aD bboard where we discussed this, and aD sort of revealed that they didn't set any headers.

Just for giggles, here are the standard headers thrown by one of our IIS installations. This is for a static page:

3 Date: Wed, 05 Dec 2001 17:16:27 GMT
4 Connection: Keep-Alive
5 Content-Length: 4663
6 Content-Type: text/html
7 Set-Cookie: ASPSESSIONIDGGQQGQWA=DKNPBLCDPDGMPCPDHHIAAEMA; path=/
8 Cache-control: private

And here's a dynamic page:

3 Date: Wed, 05 Dec 2001 17:16:36 GMT
4 Connection: Keep-Alive
5 Content-Length: 4208
6 Content-Type: text/html
7 Expires: Thu, 01 Jan 1998 07:00:00 GMT
8 Set-Cookie: Test=783687100; path=/aas
9 Set-Cookie: ASPSESSIONIDGGGGQRGF=ICOBGLOCBANJHKGLOOKAPFHE; path=/
10 Cache-control: private

I think we definitely have to do something about this. I'm just not sure what. Since with the templating system, every page is "dynamic" do we need to tag each page somehow? That seems like a pain. Or maybe the default should be no caching, and we tag the pages we know can be cached...

Hamilton pointed me to a great web page that explains caching in some detail:
    http://www.mnot.net/cache_docs/

As well as a cool tool which tells you the "cacheability" of a web resource:
    http://www.web-caching.com/cacheability.html

It seems that for dynamic pages, ACS 3/4 does NOT set any freshness information/validator headers such as Expires, Cache-Control, or Last-Modified.  According to the documentation above, such pages should NOT be cached by most proxies:

...if no validator is present, most caches will mark the object as uncacheable...

However, it seems that some proxies insist on cacheing such content anyways, which explains the strange behavior we are seeing.

With regards to static content, such as GIFs and vanilla HTML files, AOLServer sets the Last-Modified header.  Most proxies will then validate such content by checking with the origin server to see if the content has changed, fetching the latest copy if so.  This is a good thing, though some web servers use the more advanced Etag header which assigns the content a unique ID which changes when the content changes.  AOLServer does not seem to support Etag currently.

Though there are more advanced pointers in the above document, here's my understanding of basic caching as it relates to simple ACS application development:

1) For the most part, the current default behavior of AOLServer/ACS is reasonable for most caches.  However, some aggressive proxies seem to cache even dynamic pages despite having no freshness information/validators.  This could be a very bad thing depending on your application.

2) If you wish to make (reasonably) sure that your page is not cached, then, like IIS, you should probably set:
    Expires: Thu, 01 Jan 1998 07:00:00 GMT (or some date in the past)
    Cache-Control: private

You could probably thrown in a "Pragma: no-cache" or "Cache-Control: max-age=0" header in addition to/instead of the "Expires" header, however, the above document seems to suggest that "Pragma: no-cache" is NOT honored by many proxies.

BTW, this is basically what Russell said several posts ago (thanks Russell!)

3) If you wish to make sure that your dynamic content is cacheable by proxies (because it, say, doesn't change too often), then either (1) dump the content to a static page whenever it changes and link to the static page or (2) set an age-related header such as "Expires" or "Cache-Control: max-age=xxx".

4) Bottom line: proxy cacheing is not a huge issue for dynamic pages, but only when such pages are not personalized per user or not time sensitive.  If a dynamic page shows basically the same information for all users or is not time sensitive (such as, say, a news page), then the current behavior is probably not a big deal since only some users behind aggressive caches may get stale information from time to time.  However, if the page shows the user's name, is personalized in some other way, or is time-sensitive, getting a cached copy from a proxy is very undesireable behavior, and should probably be avoided by the means described above.

Given the above, I agree with C.R.  I think that ACS/AOLServer should be changed such that, by default, the "Expires" and "Cache-Control" headers are set to something similar to the ones shown above in point (2).

This may seem like a big, scary change, but I don't think it is.  Since the majority of proxies don't currently cache dynamic ACS pages anyways, they won't be affected.  Basically, only a small percentage of aggressive proxies will be affected by preventing them from cacheing dynamic ACS content, making them behave like "normal" proxies.  Dynamic pages that can be cached by proxies or that have special caching needs can be handled as described in point (3).

What do you all think?

Thanks...

Ooops, just wanted to clarify.  I meant:

Given the above, I agree with C.R.  I think that ACS/AOLServer should be changed such that, by default, *for dynamic pages*, the "Expires" and "Cache-Control" headers are set to something similar to the ones shown above in point (2).

Thanks...

Taking a look through the Squid documentation, their behaviour (which is claimed to be RFC compliant) is
  • Do not cache anything with Cache-Control: {Private|No-Cache|No-Store}
  • Consider as cacheable if ANY of Date, Last-Modified, and Expires are present
  • Ignore Cookie/Set-Cookie in requests/responses for purposes of cacheability (but filter Set-Cookie out of CACHE_HIT responses)
So (acording to the Squid folks), proxies are perfectly justified in caching default ACS pages, because while there are no Expires or Last-Modified headers returned, the presence of Date implies that the server knows how to keep track of time, and it isn't saying that the object is potentially stale, so it must be OK. Likewise the presence of cookies is irrelevant to cacheability.

I'd lean towards sticking non-cacheability headers in the master template, with an optional property pages could set saying "I'm cacheable until..."

On another note, it's interesting to see IIS set Cache-control: private on static pages... you might want to take a look at that...

Thanks, Russell, for that information.

On another note, it's interesting to see IIS set Cache-control: private on static pages... you might want to take a look at that...

Are you suggesting something is wrong with our IIS configuration? Highly possible--I'm no IIS guru.

Cache-Control: private indicates that the resource is cacheable but potentially different for every recipient, so shared caches (squid, MS Proxy, etc) should not cache it while private caches (your personal browser cache) can consider it fully cacheable. While this is how we want ACS pages to be treated it's probably not what you want for static HTML. No biggie, really, but it means some content you're serving isn't being treated as cacheable when it potentially should be.
<blockquote>>> So (acording to the Squid folks), proxies are perfectly justified in caching default ACS pages
</blockquote>

Interesting.  I'm not in the office, but in my initial experiments, a default Squid installation doesn't seem to cache dynamic ACS pages.  I need to study this further.

Perhaps what I missed was that the dynamic pages were getting cached, but Squid first attempts to validate the pages by sending an "If-Modified-Since" using the date that it previously got with the "Date:" header.  And ACS responds every time with a 200 instead of a 304 (Not Modified), which effectively made it seem like the pages were NOT getting cached.

In any case, I was able to verify, using Squid, that every request for a dynamic page resulted in a request directly to the ACS server - "effectively" no cacheing.

However, our initial findings with a default MS Proxy installation indicate that MS Proxy sometimes serves dynamic ACS pages straight out of its cache without any form of "If-Modified-Since" validation.  We were very easily able to replicate the bug where a user sees another user's home page - very bad behavior indeed!

Bottom line: I think the suggestion of having ACS/AOLServer explicitly set some combination of Expires/Cache-Control/Pragma as the default behavior for dynamic pages still stands.  We will include this in our master template and let you know the results.

Thanks.

Two things:

  1. Robert--did you have luck with setting cache control headers in the master template?
  2. I've been thinking through this today and trying to setup our master template to take a "cache-control" property. So I read through the excellent cache tutorial at http://www.web-caching.com. It would have three possible values:
    1. "none": No caching allowed, basically setting Pragma: no-cache, an Expires in the past, and Cache-control: no-cache
    2. "local": Allow local browser to cache. Cache-control: private, Pragma: no-cache
    3. "any": Caches can treat this page however they want. Sends last-modified header (see below).
    But where I'm sticking is how can I figure out what date to send with a last-modified header? When using the templating system ideally you would stat every file that the system uses to create a page, and save the latest mtime. Send that with the last-modified header. I don't know how to do that--seems like it would take some changes in the rp. Am I right? Suggestions?

That's not enough - you need to know when the tables used to build the dynamic page were last updated, not just when the script and template files were last modified.

This is a problem I've been thinking about recently.  There's no easy way to cache some types of query information in [Open]ACS 4 - multirow, in particular.  This was easy in [Open]ACS 3 and earlier.

Rather than beating my head against that problem I've been thinking in terms of doing caching of the dynamic page that's generated itself, as you have, apparently.

I've got some ideas but won't have time to explore further until we start our next release cycle.

For a site that's under development, worrying about the mtime of template/tcl/proc files used in building a particular page makes sense, but once you've deployed those things should only change rarely as updates are pulled up from staging.

The real concern for a live site is, as Don said, the last change of data in the pages, and acs_objects.last_modified is the obvious place to find that info. It's more DB intensive, of course, but joining against acs_objects and selecting max(last_modified) for whatever rows contribute data to the current page will give you what you want as long as you're doing the right thing and keeping the object metadata in acs_objects up to date.

Hi there!

Yes, setting the cache control headers in the master template as follows solved all our problems.  I would highly recommend this as the default for almost all dynamically generated pages, since, as explained above, it forces "bad" proxies to behave properly:

ns_set put [ns_conn outputheaders] "Cache-Control" "private"

ns_set put [ns_conn outputheaders] "Expires" "Thu, 01 Jan 1998 07:00:00 GMT" (or some date in the past)

(You could also send a "Pragma: no-cache", though we did not find it to be necessary or sufficient.)

However, as Don and Russell mentioned, there is no easy way of determining whether a dynamic page has changed.  If performance is a serious issue, there are some techniques described in the article above.

For example, if you know the page won't change very often, and you roughly know the frequency of change, and if sending slightly stale data to your users is not a huge concern, then you could set an "Expires" header for some reasonable date/time in the future.

Or, you could set a "Last Modified" header and, as appropriate, respond to "If Modified Since" requests from the client or proxy with a 304 (Not Modified) or 200 (send the new version of the page).  I imagine that would be pretty difficult, though Russell had some interesting ideas for implementing it.

Hope this helps...

On second thought (ie. having turned my brain on and thought
about the problem at hand for a minute or two) my response
above is totally missing the point. Caching has three aims -
reduce end-user latency, reduce server load and reduce
badwidth consumption. For sites which are high on text content
and low on "high byte count" content (as I assume most ACS
sites are) the third point is of minimal importance relative to the
first two, leaving latency and load as the relevant issues.

Running the queries nescessary to determine if a 304 Not
Modified response is appropriate is no easier or faster for the
server than running those queries and returning the page itself
(asuming time spent in the database is much greater than time
spent parsing the template), so you gain nothing other than
increased complexity by attempting to do this. It's much easier to
make a decision about how stale you're willing to allow pages
returned to the user to be and seting an appropriate Expires
header. The "304 Not Modified" approach could be made to work
by caching DB queries, but you'll still be returning data as stale
as the age of your query cache, which is just the same result as
setting an Expired header in the future but with more server work
(and minimally more bandwidth consumed) than letting the end-
user's web cache take care of it all.

On a related note, squid in the recommended configuration will
not cache any requests containing a "?" character (ie GET
requests), which counts out pretty much every interesting page
in an ACS based site. Assuming most other caches operate
similarly I'd expect this makes most pages we serve
uncacheable for most users no matter what we do, apart from
those behind overly aggressive caches such as those
discussed above.

Actually, the two caching paradigms I've been thinking of looking into might be called "cache-interval" and "cache-key".

The former is similar to what Robert's talked about above.

The latter might be used for sets of pages that are closely coupled, for instance those that display forum and message information in the bulletin board.  These are all dependent on a single key - the forum_id and there's no reason why message add, edit and delete pages couldn't essentially say "kill all cached pages dependent on this key".  This would require no db access.  It would take real care in the implementation of such closely coupled pages but the payoff could be very big.  Caching a forum summary page until someone makes a new post, for instance ... on a site like this that is slowly growing a large number of posts yet only collects a few dozen posts a day cached  copies would have a significant lifetime.

Posting a late follow up here for reference:

OpenACS 5.0 can send http headers to prevent caching, and does so by default. Check the acs-kernel parameter HttpCacheControlP.

This is wonderful news Til!  Thanks so much...