Forum OpenACS Q&A: Some thoughts on OpenACS caching

Collapse
Posted by Brian Fenton on

Hi all

I've been testing some OpenACS caching mechanisms in our product, and thought people might be interested in this. I particularly like the -cache_keys parameter available on some of the db_ procs. Of course, the easy part is caching the database call, the hard part is knowing where to flush the cache.

One proc that I noticed gets called a lot and that isn't using caching is apm_get_installed_versions https://openacs.org/api-doc/proc-view?proc=apm_get_installed_versions&source_p=1&version_id=

Now, db_foreach doesn't support -cache_keys, but I tried a quick rewrite using db_list_of_lists e.g.

ad_proc -public apm_get_installed_versions {
     -array:required
 } {
     Sets the current installed version of packages installed on this system
     in an array keyed by package_key.
     
     @param array Name of array in caller's namespace where you want this set
 } {
     upvar 1 $array installed_version
 
     set lol [db_list_of_lists -cache_key installed_packages installed_packages { 
               select package_key, version_name
               from apm_package_versions
               where  enabled_p = 't'
             }]
             
     template::util::list_of_lists_to_array $lol installed_version    
 }

Some quick timing tests gave an average speed improvement from 90745 microseconds to 603 on our ancient dev server, which is pretty good. However, I'd need to think further about the cache flush. Generally we would never enable/disable packages without a system restart, so in our case, it's probably safe not to worry about flushing the cache.

It may also be preferable to use util_memoize in this case instead of -cache_keys, as there is still the call to template::util::list_of_lists_to_array, but still I thought it was interesting to report.

Collapse
Posted by Gustaf Neumann on

I particularly like the -cache_keys parameter available

Well, the db_interface with "-cache_key" iis not a recommended interface, since this does not scale and leads to many locks. Flushing of this cache is brute-force or not happening at all. In the sketched case - as you noted -, the cache has to be flushed (in the full cluster) whenever a new package is enabled (or installed) or disabled.

If you do not care about getting proper results in these situations, one can use the version below on oacs-5-10, which will be a lock free cache (not flushed as well).

one proc that I noticed gets called a lot ...

Why is apm_get_installed_versions gets called a lot on your site. On stock instances, this is only called by the package installer...

-g

  ad_proc -public apm_get_installed_versions {
     -array:required
  } {
     Sets the current installed version of packages installed on this system
     in an array keyed by package_key.
     
     @param array Name of array in caller's namespace where you want this set
 }  {
     upvar 1 $array installed_version
 
     array set installed_version [acs::per_thread_cache eval -key acs-tcl-apm_get_installed_versions {
        db_list_of_lists -cache_key installed_packages installed_packages { 
            select package_key, version_name
            from apm_package_versions
            where  enabled_p = 't'
        }
    }]
  }
Collapse
Posted by Brian Fenton on
Hi Gustaf

many thanks for the information - as always, very informative. I don't understand what you meant about "does not scale and leads to many locks. Flushing of this cache is brute-force or not happening at all." If we are not flushing, then this shouldn't be a problem, right? In the case where we are flushing, would it be a reasonable approach to use a different cache pool for each db-* call we want to cache? That will create a different ns_cache under the hood, right?

Does the util_memoize approach also have this problem?

Thanks for the code sample - sadly our OpenACS version doesn't have this feature.

I checked to see why we are running apm_get_installed_versions so much, and it looks like one of my colleagues added that in some custom code, so yes you are correct that it's not stock code.

Brian

Collapse
Posted by Gustaf Neumann on
does not scale ...

On large applications of OpenACS (many cache entries, many threads) the performance will degrade substantially.

... and leads to many locks. Flushing of this cache is brute-force

The reasons are wild-card flushes like the ones below.


./new-portal/tcl/portal-procs.tcl: db_flush_cache -cache_key_pattern portal::get_page_header_stuff_${portal_id}_*
./acs-subsite/tcl/application-group-procs.tcl: db_flush_cache -cache_key_pattern application_group_*

The problem is that with a substantial cache (e.g. 100K cache entries) every wild-card flush requires

1. Lock the cache
2. Transfer all cache keys (e.g. 100K) into the Tcl interpreter
3. Iterate over every cache entry and perform a "string match" operation
4. For matching entries, flush these
5. Unlock the cache

The more entries, the longer the lock duration will be. When multiple such locks happen, the second one has to wait until the first one finishes until it gets the lock, Therefore, the waiting times can pile up. In the following figure, t1, t2, t3 ... are mutex lock operations issued from multiple threads.

Note that these locks bring these threads to a full stop, and even fast calls have to wait. I have seen in real world examples, where such locks take up to 100ms or longer, which means that this brings the server to a crawl. It is not unusual that even in actual versions of OpenACS there might be many locks (100+) locks on the db_cache_pool.

The main problem are the wild-card flushes. Even with the cache key that you are using (without wild cards), some other cache flushing operation might cause a long locks and make your requests slow.

The solution is to avoid wild card locks (which is sometimes hard for db-queries) and to use partitioned caches. I will tell probably something about this in the forthcoming OpenACS conference.

Does the util_memoize approach also have this problem?

Yes, pretty much. This applies for wild-card flushes on all ns_caches, also util_memoize, which is a kitchen sink cache. You might have noticed, that the usage of the util_memoize cache was significantly reduced over the last years. It is much better to have multiple specialized caches than one large cache.

sadly our OpenACS version doesn't have this feature.
If you are concerned about performance, upgrading will improve it. Caching is greatly improved over the last years.

Hope, this explains.
-g

Collapse
Posted by Brian Fenton on
That was very very helpful, I understand the problem better now. Thank you so much, Gustaf.

My takeaway from your comments is that if I never flush, then it's safe to use this caching.
And also if I do need to flush, that it is best to use a partitioned cache i.e. a dedicated cache pool, and keep it as small as possible.
Also I will take a look at our usage of util_memoize and see if there are any potential issues there.

Upgrading our system would be wonderful, but that ship has sailed unfortunately.

Best wishes
Brian

Collapse
Posted by Gustaf Neumann on
My takeaway from your comments is that if I never flush ...
The problem is not a cache flush in general, but wild card flushes (which have to iterate over all the cache keys). Also note, that wild-card flushes on the cache stop as well all other operations on this cache (e.g. the flush from portal-procs above will block as well "your" harmless operation).

... then it's safe to use this caching.
Whatever "safe" means. When something is cached from the database, and the database is updated, but the cache is not flushed, stale results (incorrect) will be reported.

In general, it is better to use specialized caches with precise flush semantics (typically via API), not based on wild-card flushes. Many small caches are also better, since this leads to better concurrency. If there is only one cache, and this cache is locked, then everything comes to a halt. On our LEARN system, we have on average 320 locks per page view, all of these can bring all running threads to a full stop.

that ship has sailed unfortunately.
When your application is performance sensitive, an upgrade will help. Since acs-core is adapted to work with Oracle 19c, upgrading your core should be feasible. This is probably less effort than cherry-picking some changes to improve performance, increase security, etc. Let me know if I can help....

all the best -g

Collapse
Posted by Brian Fenton on
Thanks for the clarifications. Understood.

Brian

Collapse
Posted by Andrew Piskorski on
Wait, so using db_multirow with -cache_key is not recommended? In that case, what is the recommended scaleable way to cache the results from db_* API calls?

I'd already noticed that the DB API caching, although very handy, has some surprising limitations. In particular, I want to report how old a particular cached result is. OpenACS does not store that info at all, so I stored it myself in an ancillary nsv. But this isn't the best solution, because if the cache is flushed in the background, my nsv storing the cache create time does not know, and gets out of sync.

Also, in the log when NaviServer starts up, my OpenACS installation says it's "Using ns_cache based on NX 2.4.0". That comes from NaviServer's tcl/aolserver-openacs.tcl file, which it describes in a comment as a, "Minimal ns_cache implementation based on NX".

So this NX-based stuff is a backwards compatibility wrapper for old AOLserver ns_cache calls? And OpenACS still uses those old AOLserver-style calls, instead of the full ns_cache_* stuff built into NaviServer? Which cache API am I best off using for my own work going forward?

Collapse
Posted by Gustaf Neumann on

Wait, so using db_multirow with -cache_key is not recommended?

Using the -cache_key option with db_* functions is generally problematic because the SQL queries can combine data from multiple tables. When data in these tables is added, updated, or deleted, the cached data can become inconsistent, resulting in discrepancies between cached and uncached queries. The developer is responsible for flushing the cache in these cases, which can be challenging.

If a single cache is used for many queries, it can become overloaded (blocked requests), containing numerous entries, and potentially locking the entire system to a standstill. Wildcard cache flushes are particularly problematic in such scenarios (see above in this thread).

Explicit flushing can be avoided by limiting the cache time (e.g., to 5 minutes) and accept temporary inconsistency, similar to the "eventually consistent" model from high-availability systems. This approach may be suitable for some applications, but not for others, where users expect immediate updates after changing some content.

For certain applications, using -cache_key might be acceptable, but it’s not ideal for building scalable systems. By "scalable," I mean caches with hundreds of thousands of entries, processing hundreds of requests per second, each triggering hundreds of locks, reaching peaks of 500,000 locks per second (measured in real OpenACS applications). Most of these locks are from ns_cache or `nsv' .

In particular, I want to report how old a cached result is. OpenACS doesn’t store that information, so I stored it in an ancillary nsv. But this isn't the best solution...

This is actually problematic - not only performance-wise (since it requires locks for both the cache and the nsv), but also in terms of atomicity, leading to race conditions. If you want to expire old entries, use the -expire option of ns_cache when creating a cache or a single cache entry. If you want to report the time when an entry was added to the user, store a small dictionary containing the timestamp and value instead of the pure value. You can consider using nsv_dict to retrieve just the value.

The recommended scalable approach to caching is to use an API rather than raw SQL queries, which allows full control over update operations. For managing ns_caches, use ::acs::HashKeyPartitionedCache for non-numeric keys or ::acs::KeyPartitionedCache for numeric keys. One can specify the desired number of cache partitions either at creation time, or from e.g. the configuration file (using different sizes for development and production).

Usage example:

::acs::HashKeyPartitionedCache create ::acs::misc_cache \
    -package_key acs-tcl \
    -parameter MiscCache \
    -default_size 100KB

set x [::acs::misc_cache eval -key foo-$id {
    db_string .... {select ... from ... where ... = :id ...}
}]
::acs::misc_cache flush foo-$id

In OpenACS 5.10.1, we have additionally lock-free caches (for per_request_cache and per_thread_cache) with very similar interfaces.

So this NX-based stuff is a backwards compatibility wrapper for old AOLserver ns_cache calls? And OpenACS still uses those old AOLserver-style calls instead of the ns_cache_* API built into NaviServer? Which cache API should I use for future work?

Since OpenACS 5.10.1 requires NaviServer, it’s possible to remove the compatibility wrapper and use the underlying functions for all packages in the oacs-5-10 branch. However, retaining the wrapper still makes sense for legacy or site-specific packages to ease the transition to OpenACS 5.10.1.

Collapse
Posted by Gustaf Neumann on
There is more about the caching reform in OpenACS in the presentation OpenACS 5.10 and Beyond given at the OpenACS 2022 conference.

all the best -g