Forum OpenACS Development: Cacing of database queries ...
I think we should add something like this to the toolkit, as the increase in performance can be major and doing caching in this way (by simply adding "-cache" to a particular db_* call) is simpler than using util_memoize. And util_memoize doesn't work well with db_multirow, either.
Our current scheme at Greenpeace is oversimplified for generalized use - we just flush the cache every five minutes and take care to only cache content that we don't mind being slightly out-of-date. We're now serving 100,000 page views (many more hits) a day with negligible load on the servers, not bad for a site where every page is built from several queries on the content repository to build news and features side panels as well as the main content area.
I've been thinking a bit about a more generalized scheme, and think that a simple approach can get us a lot of performance benefit without making things overly complex.
For instance, in my existing implementation one can specify a database cache pool with the query. One possibility would be to define a cache pool specifically for permissions, which gets flushed whenever an object's permissions are modified or perms created for a new object. In most OpenACS 4 deployed sites both of these actions are relatively rare, meaning that for comparatively long stretches of time permission queries could come from cache. Permissions performance issues? What issues? Just cache!
Of course there are pitfalls in a simple approach of this sort. Killing the perms cache wouldn't kill content-building queries that interrogate permissions directly unless they used the same cache pool, sort of defeating the purpose of allowing for different pools. So some degree of sophistication is necessary.
Still ... given our experience with the greenpeace site and the ease of implementing my overly-simple hack (I spent a couple of hours rewriting the db-* API code to cache, with the most complex being multirow, not surprisingly) makes me confident that a reasonably simple approach can succeed.
And of course we needn't force the use of caching. "disable/enable db_* query caching" and "cache pool timeout" could be parameters settable by the system admin. After all, plenty of folks are running today with no caching whatsoever.
I'm interested in comments from folks, that's the point of my post.
Fire away!
I just wanted to point out that the template db api already supports caching of query results,and it also supports individual timeouts for each query. I think it's worth keeping the individual timeouts if you add this feature into the standard db api. Probably you could set it up so that cached results could timeout as part of a pool or timeout individually.
The template db api uses nsv_sets for caching the results, and initially, I planned on converting it to use the ns_cache module. I never went ahead with that since the general consensus was that we would replace calls to the template db api with calls to the std db api. In any event, it probably would be a good idea to use the ns_cache module as a backend for you proposed changes.
Dan ... yes, I used ns_cache. Individual timeouts are definitely a good idea but have to be used carefully. For instance, if you have a set of queries that build pages with content plus panels that link to further content not already displayed, then you'd better be caching a consistent snapshot otherwise the stars won't align properly. So I think there needs to be a mechanism beyond timeouts. For our system I implemented system-wide flushing to ensure that all the page pieces are displayed consistently ... not right for a generalized solution but as a (literally) last-day pre-launch hack gets the job done for us.
I guess wether in-database caching happens and how it performs depends on a myriad of circumstances - but could somebody give me a clue, maybe by answering some of the random questions below that come to my mind:
What kind of queries are more likely to be cached by the database?
Propably database queries are more expensive in an setup like at greenpeace, where the web server is on a different machine as the database. Or is the impact of the additional network traffic neglible?
How does the database know that a particular cached query result may have been invalidated by an update? Is there some reference to table/row id's in each cache item, that are looked up upon each dml that happens in the db and in case they match, the cache item gets removed?
How do oracle and postgresql compare in that regard? Does postgresql cache query results at all?
In a ideal situation, when aolserver requests exactly the same query again and none of the involved tables have changed at all - how much faster if at all would it be to fetch the results from the aolserver cache using the -cache switch, compared to getting it from the database?
Am I totally off and in-database result caching does not exist at all?
The net effect is similar as described above for Story Server - public pages fly through the request processor with a simple check against a globably stored value. No db hit is involved.
The more general question raised by Stephen, there are complex approaches and simple approaches.
One relatively complex idea - use dependencies declared by packages and already used by the APM to zap cached content based on these dependencies.
One very simple idea - zap all cached query results whenever you do an insert or update or call PL/SQL (new etc) unless specifically told not to. Before rejecting this simplistic notion out of hand consider that even the short-term caching that would result from this brain-dead approach is better than no caching at all.
I'm sure everyone here can come up with multiple ideas which vary greatly in terms of the effort needed to implement them. I favor a relatively simplistic approach.
In GP Planet I'm just zapping the cache every 5 minutes. GP wanted minimal latency between the posting and appearance of content, and there was no time in our last-minute rush to production to implement anything sophisticated. The 5 minute zap means average latency between posting and appearance is 2.5 minutes.
The boost in performance has been considerable. We couldn't measure it precisely because after caching we were able to saturate GPI's incoming 2 mbs internet connection while keeping server load at a relatively low level. In other words our testing was limited by GPI's bandwidth at the office (the servers themselves are in Germany and we had no spare local to the datacenter to run tests from). Before caching we were serving up about 1 mbs with heavy server load.
As far as RDBMS caching ... PG doesn't do it. There's been discussion in the group and the consensus, which I agree with, is that this is best done at the application level. One reason is that my caching versions of the db API procs cache more than the query result. For instance in the case of db_multirow I'm caching the result array, which means that not only the cost of the query is avoided but the cost of the (optional) Tcl block which is executed for each row returned by the query. For those of you thinking that any general implementation needsd to allow the caller to override the caching of the results of the code block, you're right. But in the GP Planet context caching the full result is just what we want and saves processing cycles above and beyond those involved in executing the RDBMS query.
Likewise db_list caches the list that's built, avoiding the cost of lappend'ing the query result row by row.
So caching at this level gains more than would be gained by the RDBMS doing query result caching. This is the kind of reasoning that's led the PG group to reject RDBMS-level query caching (for now at least).
One very simple idea - zap all cached query results whenever you do an insert or update or call PL/SQL (new etc) unless specifically told not to. Before rejecting this simplistic notion out of hand consider that even the short-term caching that would result from this brain-dead approach is better than no caching at all.
don: while i agree that this approach is better than no caching at all, i have to say that it would take just a little more effort to maintain the cache in a smarter fashion and this would give us a great increase in performance in comparison the simplistic approach of blowing away the cache completely every time. one concrete example of this is the site node cache. until a few weeks ago the site node cache would be blown away completely on creation or update of a site node. on a site with a lot of site nodes (sloan has around 8500 site nodes) this was adding about 30 seconds of overhead for the creation of each site node. creation of a community will usually cause the creation of about 8 to 10 site nodes. you can see how this can become unreasonable. by being a little smarter about how the site node cache is maintained i was able to reduce the overhead to something reasonable (i believe the cache maintenance overhead is now in the order of XX milliseconds). check out the site node code in the development tree.
i know that maintaining this cache isn't as easy as the site node cache but we should be able to come up with a scheme that works. one possibility is to parse any FROM clauses and key our cache on that. when a table is updated we mark any cache entries that have that table in the key as dirty. i know there are plenty of holes that need to be worked out in this method, it is only an initial suggestion.
Parsing the "from" clause is non-trivial to the point of probably not being useful (think "subselect") and of course tables whacked by PL/SQL can't be tracked by the db api running at the tcl level. This latter issue is the real killer when it comes to automatically tracking down which cache items to flush.
I don't have any brilliant ideas myself. Perhaps segregration of database changes into two classes - "must take place now" (flush cache immediately) or "can be delayed" (flush at next timeout) - is a compromise simple solution that could work??? Changes to permissions should take place immediately, perhaps addition of content can wait.
Of course any caching policy will have to be configurable and we'll always want to make it easy to turn off (in fact "off" should probably be the default). Only relatively busy sites need to cache. A big healthy dual-AMD DDR-based system with well-written queries should be able to run the majority of sites out there. We're really talking the top few percentile of busy sites when we talk about the need for caching.
Another issue with cache schemes more complex than "flush every N minutes" is the synchronization of cache between multiple AOLserver front-ends. The fact that Greenpeace runs two webserver front-ends was really fortunate in the sense that it kept me from even thinking about trying to implement any fancy cache scheme in the last day or so before launch :)
* Cache a portion of a page for some period of time
* Cache a particular query forever
* Cache a particular query for some period of time
* Cache a particular query until some event happens.
The event that indicates the need for cache flusing is not standard and is not trivial in some cases. Our solution to the above three requirements was to use a key based cache with an optional parameter to specify the length of time after which the cache expires. Whenever an event occurs that indicates the need for cache flushing (e.g. creating a new site node), we manually flush the appropriate cache using an appropriate key.
In the OACS world, one possible way to do the same for the queries is to use the unique keys we already have for the queries as the cache keys. When you create a site node, you flush that key from the cache (or at least mark it to be flushed at next access).
In practice, I have found that I really only need/want to cache a few queries - the expensive ones or the ones our users hit the most often. As a developer, i know what these queries are - after all, I already had to tune them :) I would thus be very happy with an API to cache individual queries rather than a parameter to cache all my queries. With this type of design, it is up to the developers to decide explicitly what to cache, for how long, and when to flush.
From the previous posts, we have the following concrete needs:
* Greenpeace: flush individual db queries every 5 minutes
We can provide a -cache_time flag to the db_* procs and simply use the already unique name of the query as a cache key.
* Permissions: "define a cache pool specifically for permissions, which gets flushed whenever an object's permissions change"
My assumption here is that we would be accessing the permissions we are checking through an existing api (e.g. ad_require_permission or something similar). These methods would then explicitly cache the results of the permission checks using a key which could be deterministically generated from the object_id.
The code which updates the permissions of an object could then flush the cache for the permissions key for that object.
* Grouping of cached items - if you flush one item, you must flush these other items (from the example on keeping content in-sync with its associated side panels)
This is a pretty tough requirement. A cursory glance through the docs for ns_cache indiciates it is already capable of flushing the cache based on a key pattern. If so, we can meet this requirement by creating composite cache keys and then flushing based on the first part of the key.
* Flushing site nodes
The strategy here would be similar to permissions. We ensure a deterministic way to get the cache key for a given site node. When a site node is edited, we flush the cache for that key.
What do you think about requiring a bit more work from the developer to remove the need for complex logic pertaining to cache flushing? I haven't thought too much about the details of implementing the above in OACS, but I believe we have enough support with ns_cache to do the above. Please do correct me if I'm mistaken here.
While I see the benefits of caching database queries, is it really a requirement to provide automatic caching of all queries?No, not in my mind. When I talk about being able to parameterize caching I mean I think it would be good to allow one to override a "-cache" switch passed to the db api. I think this is particularly important if we just implement a simple "flush occassionally" strategy. One way of thinking about this is that we allow the sitewide admin to set the cache timeout to a number that suits their needs, including "timeout in zero ticks" (which for efficiency's sake would turn off the cache code entirely). If simple timeouts are implemented the tradeoff between the time before new content appears and performance is going to depend on traffic vs. one's hardware. And low-traffic sites needn't cache at all.
However ... in my thinking the above applies to db api calls that include an explicit "-cache" switch. I'm not suggesting that all queries be cached automatically.
The query name itself isn't sufficient as the query result depends on the value assigned to bind vars as well. Thus a key which includes the fully qualified name from the query dispatcher (which the db api already has access to) and bind var values (most likely provided by the caller since parsing them out in the db api would be expensive given the legality of constructs like the following SQL literal: ':foo') is needed.
Composite keys using a prefix value such as you suggest is an idea I've had, too. A first glance it's an attractive idea. It's also an idea that gets more complex the more you think about it. Well ... the more I think about it, anyway, maybe it will get less complex the more you think about it. If so ... share :) If we depend on the programmer to get their dependencies straight on each query we'll see errors. If we went this route I'd suggest specifying the dependencies in the Tcl call not .xql files since queries depend on the shared architecture of the datamodels, not Oracle/PG-specific query syntax
The one place we already track package dependencies is in the APM which is why I'm thinking perhaps this could form a basis for figuring out when to flush queries. Any change to an acs-kernel datamodel would flush the cache for all packages, while changes to "news" wouldn't, and changes to the content repository would lie between these extremes. Is this a practical idea? I don't know, it just came to mind when I was thinking about dependencies and how to track them.
Your last post brought up a few new thoughts/questions:
It seems like you want to provide a generic user interface where the site wide admin can set global parameters related to caching. Is this realistic/desirable in practice or is it just a simple quick solution? I have found that the caching I've added tends to add confusion to the user experience. The only way I've been able to address this confusion is by explicitly providing controls for each thing that I've cached. Let's say the user is editing a live press release. If press releases are cached, I would want the cache controls directly available from the UI used to edit the press releases. Most of my users tend to create/edit items and then jump over to the public site to see how they look. If caching is enabled, they might be confused as to why their changes are not visible.
We could provide a more intuitive user experience by tying cache flushing into the user interface for the cached object. With these controls, I'm not sure we need global parameters like a default refresh time. Maybe your users have requested different functionality than mine - if so, please share :)
I agree that using composite keys is harder than it first seems. I'm also having a hard time understanding the real-life requirements for caching groups of objects together. In my experience, when I need to cache groups of objects together, I end up caching the end result of the top-level object only. For example, if I am going to cache my home page, and it contains a bunch of other expensive operations, I am perfectly happy caching the end result of building the home page (i.e. the string containing the page contents) and refreshing the entire string every N minutes. This strategy may address most of our needs.
I'm also having a hard time understanding what dependencies you want to track between different cached queries. Everything I can remember caching has had pretty simple dependency trees and I have been perfectly happy tracking those dependencies myself. I would also be weary of implementing cache flushing based on package dependencies. It is perfectly acceptable, and quite likely, that my package will depend on acs-kernel as a package, but one individual query I cache will itself have no dependencies on acs-kernel.
For our redhat ccm project, we use OSCache from OpenSympony as part of our caching strategy. Their concept is to cache a fragment of a JSP for some period of time or until an explicit flush is signalled. They also provide an events-based framework to receive notification when various cache events happen (e.g. notify me when this site note is flushed from the cache). If we added similar functionality to OACS, then as a developer, I can flush my own caches when something gets flushed from acs-kernel by registering a call back.
This stategy has worked quite well for us so far. Here's a possible example of how this might work in OACS:
set user_name [util_cache -key user_name_cache_${user_id} -refresh 60 { return [db_string select_user_name { select display_name from acs_objects where object_id = :user_id }] }]The inner block could be as complicated as we need, but the end result would be a single cached string. The developer is responsible for any dependencies and key management. Key management might be a bit more work for the developer at first, but down the road it makes it explicit how to flush entries. Explicitly defined keys also makes it clear how to flush multiple entries at one time.
To summarize:
- Do we need a generic "Cache administration" page or should this functionality be integrated with the user interface for each object we are caching?
- Can we do away with auto-generated cache keys? They add complexity to the caching system and creating them explicitly shouldn't be too difficult.
- Is "cache a group of objects and refresh together" a unique requirement or is it something we can support by saying "Cache the top-level object only?"
- Is it really too much work for developers to track their own cache dependencies?
- Should we add support for caching a portion of a generated html page?
We could provide a more intuitive user experience by tying cache flushing into the user interface for the cached object. With these controls, I'm not sure we need global parameters like a default refresh time. Maybe your users have requested different functionality than mine - if so, please share :)Valid points here and in your previous paragraph.
In the Greenpeace project we
- had a non-negotiable launch date to meet
- inherited hopelessly broken code from a previous vendor that we could only partially fix given the non-negotiable launch date
- have hardware spec'd by a third party and with which the site must run for about one year while growing steadily in traffic (by adding additional Greenpeace national sites to the server)
- had a fixed budget
You're right that caching portions of pages would be a big win. I've been thinking about ways to cache the output of the ATS "multiple" tag, for instance. This would be even better than caching the underlying "db_multirow" query.
Is it too much work to ask developers to track their own cache dependencies? Well ... it's certainly error prone. Interestingly a couple of days ago Peter Marklund asked Lars Pind if he was caching the front page of pinds.com. Lars said "yes". Peter said "I thought so, I see something doesn't change on the front page when I (post? - Don forgets the exact point)". Lars said "oops, that's a bug" and then fixed it of course.
The point of this anecdote is that Lars is a very good programmer, his site's fairly simple, and yet he managed to slightly botch his personal cache code. We have a bunch of skilled volunteers working on OpenACS 4, we have the beginnings of a formal testing program, but there's no way I'd claim we that at the moment we can exercise the level of Q/A that a mature software company can. And we know how hard it is for software companies to exercise good Q/A. So I worry about introducing mechanisms that are potentially highly error-prone.
In practice if a package has its own cache pool for query results (or html snippets from the template system or both), it should be easy for the developer to remember to zap the cache when new content (a bboard post, etc) is added. Wholesale zapping of package content should still result in enough caching to greatly boost overall performance, because content is added fairly infrequently to any given package even on relatively busy systems.
That's the easy part. Permissions changes are problematical - maybe we define package callbacks for permissions changes on content the package owns? Other than permissions, what other global tables invalidate package content when they change? Any? Permissions is what comes to mind.
Perhaps simply caching multirow (or template multiple tag results) and not caching other queries would be sufficient for the general toolkit. After all, most sites work fine without any caching whatsoever. Perhaps more sophisticated caching needed for very busy, complex sites need not be addressed by the toolkit out-of-the-box. If caching were restricted to such queries in the toolkit proper, I think we could get away with simple flushing of package-specific cache pools whenever the package modifies its content, or whenever permissions on an object owned by the package change.
Would this get us, say, 75% the benefit of a more complex scheme? If so, how many sites would need anything more complex? Hardly any, I imagine.
I don't know, though. The purpose of this thread is to trigger discussion, so let's keep on with it. I appreciate your sharing your experiences ...