Forum OpenACS Development: Strange problem with util_memoize flush

Hi everybody,

The last days I'm facing an strange problem with util_memoize flush that I haven't seen before. The problem happens when util_memoize cache gets close to the limit (acs-kernel MaxSize parameter).

When it gets close to the limit and util_memoize is called, the following command is executed:


set pair [ns_cache eval util_memoize $script {
list $current_time [eval $script]
}]

At this point, the ns_cache eval operation keeps running forever, and there's not return for the util_memoize call.

This is a serios problem, because if you have an user that was accessing the system and had for instance some permission cache, when he access some page again the system will look for the stored cache and the value won't be there. So the user request never gets an answer.

From my understanding of ns_cache, it should release the memory for the older cache values and set the operation to be not cached. I guess somehow the system is not performing the right operation, wich means that it releases the memory where the old cache was restored but it doesn't inform ns_cache that the old cache doesn't exist anymore.

Does anybody have a clue about why this is happening?

Collapse
Posted by Gustaf Neumann on
What version of OpenACS and aolserver naviserver are you running? aolserver 4.0, 4.5.1 and naviserver have three different implementations of ns_cache.
Collapse
Posted by Eduardo Santos on
Hi Gustaf,

Thanks for the reply. I'm using HEAD version of AOLServer 4.5.1

Collapse
Posted by Eduardo Santos on
Oops, forgot the OpenACS version. I'm also using OpenACS 5.5.1 (from tag oacs-5-5 in CVS)
Collapse
Posted by Gustaf Neumann on
can you verify, that you don't load nscache.so from your config.tcl?
Collapse
Posted by Eduardo Santos on
Hi Gustaf,

I remember this problem when I updated the version, and I've taken out from my config.tcl the loading of nscache.so. This is the loading part in my log:


[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Error: pidfile: failed to open pid file '/usr/local/aolserver/log/nspid.ct-gcie': 'Permission deni
ed'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: encoding: loaded: utf-8
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: fastpath[ct-gcie]: mapped GET /
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: fastpath[ct-gcie]: mapped HEAD /
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: fastpath[ct-gcie]: mapped POST /
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: adp[ct-gcie]: mapped GET /*.adp
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: adp[ct-gcie]: mapped HEAD /*.adp
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: adp[ct-gcie]: mapped POST /*.adp
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: modload: loading '/usr/local/aolserver/bin/nssock.so'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: modload: loading '/usr/local/aolserver/bin/nslog.so'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: nslog: opened '/srv/logs/ct-gcie/uxmpog001/ct-gcie.log'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: modload: loading '/usr/local/aolserver/bin/nssha1.so'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: modload: loading '/usr/local/aolserver/bin/nsdb.so'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: modload: loading '/usr/local/aolserver/bin/nspostgres.so'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: PostgreSQL loaded.
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: modload: loading '/usr/local/aolserver/lib/thread2.6.5/libthread2.6.5.so'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: conf: [ns/server/ct-gcie]enabletclpages = 0
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: default thread pool: minthreads 20 maxthreads 30 idle 0 current 0 maxconns 100 queued 0 ti
meout 120 spread 20
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: stats: web stats enabled for '/_stats'
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: XOTcl version 1.6.3 loaded
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: Loading OpenACS, rooted at /var/www/ct-gcie
[13/Jan/2011:03:00:05][15181.18446744071738828512][-main-] Notice: Sourcing /var/www/ct-gcie/packages/acs-bootstrap-installer/bootstrap.tcl

Thanks for the help.

Collapse
Posted by Gustaf Neumann on
My "hope" was that you had some misconfiguration, but so far, i can't see anything obvious. How large is your util_memoize cache? I tried for a while to reproduce the problem with variations of
string length [util_memoize {string repeat abcde 30000}]
in the ds/shell, but could not reproduce the problem. Can you produce a sequence of util_memoize commands like the one above (e.g. loop) to reproduce the problem on your site?
Collapse
Posted by Eduardo Santos on
Hi Gustaf,

My cache is 200000 MB large, and the problem happens clearly when the cache gets full. I can see from AOLServer stats the following info:

Cache Max Current Entries Flushes Hits Misses Hit Rate
xowiki_cache 200000 2026 6 0 110 106 50%
xotcl_object_type_cache 10000 346 37 35809 92407 35846 72%
xotcl_object_cache 400000 19591 16 0 6152 16 99%
util_memoize 200000 199939 2852 7557 1109321 26009 97%
secret_tokens 32768 600 15 0 300 15 95%
nsfp:ct-gcie 5120000 934418 137 0 1557 137 91%
nsfp: 5120000 0 0 0 0 0 0%
ns:dnshost 100 0 0 0 0 0 0%
ns:dnsaddr 100 5 5 0 230 5 97%
ds_page_bits 10000000 0 0 0 0 0 0%
db_cache_pool 200000 20 2 0 7841 2 99%

When the cache gets close to the limit and the system try to reload it, it fails and the reloading proccess lasts forever.

Collapse
Posted by Gustaf Neumann on
The statistics indicate that your util_memoize cache is 200000 bytes, not MB! I do not understand the last sentence. What exactly is "it"?

Yet another attempt to help you:
Run the following script in your ds/shell:

foreach i {1 2 3 4} {util_memoize [list string repeat abcde 3998$i]}
append _ "nearly full: " [ns_cache_size util_memoize] " "
foreach i {4 3 2 1} {util_memoize [list string repeat abcde $i]}
append _ "over full: " [ns_cache_size util_memoize]
The first line fills your cache to the size 199931 bytes. The third line crosses in several steps the 200k border. Does this work on your machine?
Collapse
Posted by Gustaf Neumann on
We have seen a few occurrences, where a ns_cache eval blocks and can lead to eventually all connection threads blocking. The situation happens, when the script executed by ns_cache eval requires a handle, but the server runs out of handles. Make sure, that you have maxthreads+15 handles in pool1 and pool2, and that pools is eg. pool1,pool2,pool3 (and not * as it was allowed in earlier aolserver versions). The value of 15 is an estimate, assuming that there are not more than 15 scheduled procs running needing database handles.

Maybe this can help someone
-gustaf neumann

Collapse
Posted by Eduardo Santos on
Hi Gustaf,

Thank you for your answer. I've had this same tought as you, but I don't think this is the issue. I have 100, 50 and 50 connections por pool1, pool2 and pool3, and may maxthreads is setup to 35.

Checking the threads currently running on the system, I have the following list:

:throttle 1141053776 NS_THREAD_DETACHED 03:00:53 AM on 03/03/2011 NULL NULL
::bgdelivery 1164183888 NS_THREAD_DETACHED 08:15:22 AM on 03/03/2011 NULL NULL
-sched:idle8- -sched- 1166289232 NS_OK 08:25:04 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x8
-sched:idle7- -sched- 1155791184 NS_OK 03:50:18 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x7
-sched:idle6- -sched- 1153685840 NS_OK 03:12:50 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x6
-sched:idle5- -sched- 1151580496 NS_OK 03:05:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x5
-sched:idle4- -sched- 1149475152 NS_OK 03:04:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x4
-sched:idle3- -sched- 1147369808 NS_OK 03:02:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x3
-sched:idle1- -sched- 1143159120 NS_OK 03:01:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x1
-sched:idle0- -sched- 1094764880 NS_OK 03:01:15 AM on 03/03/2011 p:0x7ff42aa5c760 NULL
-sched:22- -sched- 1145264464 NS_OK 03:01:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x2
-sched- -main- 1084238160 NS_OK 03:00:05 AM on 03/03/2011 p:0x7ff42aa5c970 NULL
-nssock:driver- -main- 1130555728 NS_OK 03:00:52 AM on 03/03/2011 p:0x7ff42aa4beb0 a:0x6292b0
-main- 719976160 NS_THREAD_DETACHED 03:00:05 AM on 03/03/2011 NULL NULL
-default:9- -main- 1107396944 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:8- -main- 1105291600 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:7- -main- 1103186256 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:6- -main- 1101080912 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:5- -main- 1098975568 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:4- -main- 1096870224 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:3- -main- 1092659536 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:20- -nssock:driver- 1132661072 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:2- -main- 1090554192 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:19- -main- 1128450384 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:18- -main- 1126345040 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:17- -main- 1124239696 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:16- -main- 1122134352 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:15- -main- 1120029008 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:14- -main- 1117923664 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:13- -main- 1115818320 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:12- -main- 1113712976 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:11- -main- 1111607632 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:10- -main- 1109502288 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread
-default:0- -main- 1086343504 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread 7482 200.198.213.70 running GET /_stats/threads.adp 0.7389 0

When the cache problem happens however, the DB queries seem to last forever:

[03/Mar/2011:10:32:57][31076.1166289232][-sched:9-] Warning: db_exec: longdb 928 seconds nsdb0 select dbqd.acs-mail-lite.tcl.acs-mail-lite-procs.acs_mail_lit
e::sweeper.get_queued_messages

The problem doesn't happend with scheduled procs only. Regular DB calls show errors:

[03/Mar/2011:09:51:35][31076.1132661072][-default:20-] Warning: db_exec: longdb 929 seconds nsdb1 0or1row dbqd.acs-tcl.tcl.request-processor-procs.rp_lookup_
node_from_host.node_id
[03/Mar/2011:09:51:35][31076.1132661072][-default:20-] Error: Error in include template "/var/www/ct-gcie/packages/xowiki/www/portlets/portlet-minimo-comunid
ade": Database operation "0or1row" failed
(exception ERROR, "could not receive data from server: Connection timed out
")

The problem with caching is: when it grows to the limit, the cache flush operation seems to last forever and the DB call to reload the cache goes to a timeout.

I still have no clue about the reasons for this behaviour.

Collapse
Posted by Gustaf Neumann on
The symptoms on your system are that every database query suddenly takes 928 seconds! Unless you have millions of host-node-entries defined, rp_lookup_
node_from_host.node_id
should take significantly less than one ms. Running out of db-ressources can explain the symptoms, running out of space in a cache of the aolserver not (caches are no supposed to keep all information put into it, they are supposed to keep the most important ones, once the space is not sufficient to keep all information).

What is the configuration value for max_connections in your postgres configuration?

Collapse
Posted by Eduardo Santos on
Hi Gustaf,

I had also these same toughts. The thing is that I've already reviewed 100 times PostgreSQL configuration and I couldn't find any bottleneck. The machine is ok with resources, PostgreSQL is well tuned and we never get out of resources.

I've configured 200 in max_connections, but it never get more than 30 at the same time. I've double checked Warm Stand By config, shared_buffers, kernel shared memory, everyhing that I know and PostgreSQL seems to be ok.

It seems like a cache problem to me because it only happens when util_memoize cache goes full. In any other situation the queries take the regular amount of time.

In PostgreSQL log I keep seeing the same timeout messages:


2011-01-14 21:08:43 BRST LOG: não pôde receber dados do cliente: Tempo esgotado para conexão

It's a timeout message saying that the server got a EOF message from the client connection.

This problem is being very difficult to catch.

How do AOLServer implement connection pooling? Are they persistent connections? Is there a separated pool for cache renew queries?

I'm asking this because there can be some connection reset on the network structure based on connection time. Maybe this can be problem (I'm running out of ideas).

Gustaf, thank you very much for your time and your answers. They're being very helpfull.

Collapse
Posted by Gustaf Neumann on
If i trust google translate, the message says: "could not receive data from client: Connection timed out" which clearly indicates a database problem (as the 900+ secs times).

Do you have MaxIdle and/or MaxOpen set for pool1 or pool2? If so, remove your settings.

Concerning your questions: ns_cache is completely agnostic about the database. The worker threads (e.g. connection threads) require db-handles (references to db-connections) via gethandle, they require in most cases a pool1 connection and sometimes a pool2 connection (for subqueries). The handles are released either explicitly (happens seldomly) or automatically when a request is done and the worker thread is done.