Forum OpenACS Improvement Proposals (TIPs): TIP #57 (Approved): Add caching db_* API (based on Greenpeace/Berkeley work) to 5.2

When implementing Greenpeace planet I enhanced the database API to allow for optional caching of rows returned by queries.  We found this *extremely* useful.  The Berkeley School of Music picked it up and enhanced it slightly when implementing their Learning Management System and also found it extremely useful.

Why use a caching DB API rather than util_memoize?  Among other things you don't have to write a special non-cached Tcl proc to contain the query ... at its simplest you just add "-cache" to a db_* call and the query result is cached for a configurable amount of time.  This makes it very simple to cache individual nasty queries contained within content templates.  It also will cache structures like ns_sets properly.

The original implementation uses the query itself as the key, which is unsatisfactory (it didn't subst Tcl bindvars so you needed to "$" those you want to control the uniqueness of the query).  For general use, I'd implement a couple of different ways to uniquely tag queries - the query name plus a list of key/value pairs, for instance, with flush-by-key/value pair as one option.  Another would be util_memoize-like regexp flushing with your own custom key.

Don,

When I first read your proposal casually, I had a hard time understanding it, namely separating your information about the existing implementation (util_memoize), what you implemented previously, and your proposal.

The recommended format for TIPs is described in TIP #2. Basically the different parts of the proposal are labelled with standard headers. This would make it easier to read and understand TIPs quickly. Imho TIP #29 is a good example of the suggested format.

I'd suggest that future TIPers keep this in mind.

Thanks, Guan

Hi Don,

Would you be able to set the timeout on a per query basis or is it site wide?  Or something in between.

At any rate this is a real great addition and one that I am waiting for.  I just hope it doesn't get abused.

Hi Don,

I would like to ask if its possible to cache a query on the duration of a single request.

For example rather than passing around a big data structure around procs, etc.  You just rerun the query since its already been cached.  So the cached data structure now becomes your way of accessing your needed data, rather passing around across your procs using custom code.  Although I would like it to be in just for the duration of a single request.  So its going to be like db_transaction of some sorts, well only on selects that is.

Thanks looking forward to use this new facility in oacs.

I like the idea, and if you can key it on the values of the tcl bind variables it will be great.

Jun, look into using namespace variables to pass around data strucutres instead of upvar. I should start a thread on this.

"Would you be able to set the timeout on a per query basis or is it site wide?"

I think for Greenpeace I implemented a timeout value per cache pool ... it's not clear that using different timeouts for different queries is a very good idea.  If one (say) forum content query caches for two minutes and another for five minutes page-to-page results might be ... confusing!

Dave, I'll probably use the query with tcl bindvars substituted as the default key.  But more useful will be the ability to define a set of objects you want to key a query to and to then be able to flush the cache based on elements of that key.

For instance ... the portal system in various scenarios will want to flush all queries for a particular portal, a particular portal page, or a particular portlet.  By using these three values as a key and then by implementing a flush mechanism that flushes based on a partial key (say a portlet object id only) it will be easy to do so.

One reason I didn't put the caching db_* API into OpenACS two summers ago (though several folks asked me to do so) was that I was hoping to develop a more general caching scheme that would maintain coherency.  But our thin-layer Tcl+db query structure makes this impossible to do automatically IMO (and I've thought long and hard about it several times).  But a flexible key and flushing capability should let individual application authors do so without much trouble.

Jun ... I agree with Dave that some other technique, such as the namespace technique he refers to, is better suited for the case you brought up, which isn't really query result caching but rather a more efficient way to pass query results around.

Guan ... at its simplest the caching db_* API looks like something like this:

db_1row -cache my_query_name {}

as opposed to:

db_1row my_query_name {}

As far as syntax for building composite keys from object_ids etc I won't be working on that until I return to PDX, and then only if folks approve of the general notion.  Before implementation I'll post syntax for comment but if people don't like the idea of direct query caching there's not much point in my putting effort into further design ...

Don,

Using different timeout per query might be helpful.  For example the query to out put "Today's Features" will timeout after 24 hours.  While the forums query will timeout in 2 mins or less.  Although since you mentioned that it will be per cache pool.  So its possible to create 1 day pool, 1 min pool?

modified title of posting to conform with TIP standards
Jun, I think that sounds like a very error-prone way to deal with timed queries, which should depend on having date calculations as part of the query itself.

This is true whether or not one uses a caching db_* API, util_memoize, or ns_cache natively.

It would take a much better example than this for me to believe that per-query timeouts are more useful than dangerous.

If it is up to the develop to decide how long to cache when using util_memoize, I would vote (if I were a voter) for giving the developer the choice of when using db_* caching, perhaps with some guidance in the api-doc.

With or without caching, though, this is a great contribution. I'm hoping it gets approved!

maybe ignore the following harebrained idea..

Instead of using a cache time-to-expire,

have a tcl array (somewhere in tcl namespace) that tracks time of last write to the table (and time query was cached).  Then have the query automatically clear the cache when the time of the cached query is older than the latest write to that table.

One problem with this suggestion is that queries often reference several tables, and that tables can be bashed by triggers or PL/[pg]SQL ...

I really think composite keys, typically based on object_ids, with individual applications being responsible for flushing cached query results when content is updated or added is the only practical approach.  At least I've not been able to think of any practical automagic cache coherence approach that works for general queries.

I'm not sure I understand the concept of composite keys. Why not just attach a list of object_ids to a cached query, and whenever at least one of those ids is mentioned in a flush statement the query is flushed?

Thus the example could look like this:

set cache_ids [list $my_message_id $my_reply_id $my_forum_id $some_other_arbitrary_id ...]

db_1row -cache my_query_name -cache_ids $cache_ids {}

Tilmann, one problem with your scheme is that for a lot of queries you do not know the ids when you do the query (eg, for a thread in forums you know the root message id, but not all the childrens ids).

Another is that with the forums example, adding a post adds an id rather than changing any existing ids so you have to do an explicit flush on the parent_id anyway.

A third really hairy issue is that for queries which involve permissions checks, it would be prohibitive to try to do any real dependency checking; and I given the varying need for security such queries probably should not be cached (or at least there should be some option to disable caching of queries involving such checks). Certainly, the experiences with PermissionsCacheP point to a lot of problems with caching such things.

I also think its really easy to go overboard on dependency checking, I have worked with a couple of different systems that had dependency checking for caching or for lazy evaluation and both had numerous places where checking the dependencies was slower than simply recalculating.

In any case, I definitely approve the tip, although I think it will be a while before we are really happy with how things work.

Tilman, the problem is that you might want to use something other than an object_id as a key.

Jeff ... the permissions case is not easily solvable and naive use of the db_* caching API will work exactly as the current util_memoize approach to caching forums or other content.  In other words properly written queries (that do an EXISTS query against the permissions view) won't honor a perm change until your cache is flushed.

This isn't good ... but then again it's true today with util_memoized procs containing the equivalent queries.  This extension to the db_* API won't solve such problems, it will just simplify caching.  Coherent permissions caching is a problem unto itself which needs a special solution regardless of whether caching is implemented via a proc cache method like util_memoize or a smart db_* caching API.

A simple solution would be to have the permissions Tcl API flush all cache pools registered to the db_* API.  While this is gross overkill you'd still get the advantages of caching most of the time since ... callbacks are another approach?  Not sure ... it's not a simple problem.

Ok, not keying always to object_ids makes sense - I still don't get what is meant by composite keys though. Just a list of arbitrary strings where each string is a freely choosable key? Anyway,

Approve.

And yes, when permission checks are involved caching should be avoided of course. Flushing in the tcl api permission call won't catch all cases either, there are lots of calls to the pl/sql permission api in the toolkit.

I'm guessing, if this is similar, but easier to use than util_memoize, that you would use this on a production site, where the developer would be free to choose the keys. You wouldn't normally write a package with this switch, would you? Although a comment with suggesions might be helpful.

2 approvals and no vetos in 1 week; marking approved.