Forum OpenACS Development: Re: Strange problem with util

10: Re: Strange problem with util_memoize flush (response to 1)

Posted by Gustaf Neumann on 03/03/11 10:24 AM

We have seen a few occurrences, where a ns_cache eval blocks and can lead to eventually all connection threads blocking. The situation happens, when the script executed by ns_cache eval requires a handle, but the server runs out of handles. Make sure, that you have maxthreads+15 handles in pool1 and pool2, and that pools is eg. pool1,pool2,pool3 (and not * as it was allowed in earlier aolserver versions). The value of 15 is an estimate, assuming that there are not more than 15 scheduled procs running needing database handles.

Maybe this can help someone
-gustaf neumann

11: Re: Strange problem with util_memoize flush (response to 10)

Posted by Eduardo Santos on 03/03/11 03:04 PM

Hi Gustaf,

Thank you for your answer. I've had this same tought as you, but I don't think this is the issue. I have 100, 50 and 50 connections por pool1, pool2 and pool3, and may maxthreads is setup to 35.

Checking the threads currently running on the system, I have the following list:

:throttle 1141053776 NS_THREAD_DETACHED 03:00:53 AM on 03/03/2011 NULL NULL ::bgdelivery 1164183888 NS_THREAD_DETACHED 08:15:22 AM on 03/03/2011 NULL NULL -sched:idle8- -sched- 1166289232 NS_OK 08:25:04 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x8 -sched:idle7- -sched- 1155791184 NS_OK 03:50:18 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x7 -sched:idle6- -sched- 1153685840 NS_OK 03:12:50 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x6 -sched:idle5- -sched- 1151580496 NS_OK 03:05:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x5 -sched:idle4- -sched- 1149475152 NS_OK 03:04:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x4 -sched:idle3- -sched- 1147369808 NS_OK 03:02:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x3 -sched:idle1- -sched- 1143159120 NS_OK 03:01:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x1 -sched:idle0- -sched- 1094764880 NS_OK 03:01:15 AM on 03/03/2011 p:0x7ff42aa5c760 NULL -sched:22- -sched- 1145264464 NS_OK 03:01:45 AM on 03/03/2011 p:0x7ff42aa5c760 a:0x2 -sched- -main- 1084238160 NS_OK 03:00:05 AM on 03/03/2011 p:0x7ff42aa5c970 NULL -nssock:driver- -main- 1130555728 NS_OK 03:00:52 AM on 03/03/2011 p:0x7ff42aa4beb0 a:0x6292b0 -main- 719976160 NS_THREAD_DETACHED 03:00:05 AM on 03/03/2011 NULL NULL -default:9- -main- 1107396944 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:8- -main- 1105291600 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:7- -main- 1103186256 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:6- -main- 1101080912 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:5- -main- 1098975568 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:4- -main- 1096870224 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:3- -main- 1092659536 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:20- -nssock:driver- 1132661072 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:2- -main- 1090554192 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:19- -main- 1128450384 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:18- -main- 1126345040 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:17- -main- 1124239696 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:16- -main- 1122134352 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:15- -main- 1120029008 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:14- -main- 1117923664 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:13- -main- 1115818320 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:12- -main- 1113712976 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:11- -main- 1111607632 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:10- -main- 1109502288 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread -default:0- -main- 1086343504 NS_OK 03:00:52 AM on 03/03/2011 ns:connthread 7482 200.198.213.70 running GET /_stats/threads.adp 0.7389 0

When the cache problem happens however, the DB queries seem to last forever:

[03/Mar/2011:10:32:57][31076.1166289232][-sched:9-] Warning: db_exec: longdb 928 seconds nsdb0 select dbqd.acs-mail-lite.tcl.acs-mail-lite-procs.acs_mail_lit e::sweeper.get_queued_messages

The problem doesn't happend with scheduled procs only. Regular DB calls show errors:

[03/Mar/2011:09:51:35][31076.1132661072][-default:20-] Warning: db_exec: longdb 929 seconds nsdb1 0or1row dbqd.acs-tcl.tcl.request-processor-procs.rp_lookup_ node_from_host.node_id [03/Mar/2011:09:51:35][31076.1132661072][-default:20-] Error: Error in include template "/var/www/ct-gcie/packages/xowiki/www/portlets/portlet-minimo-comunid ade": Database operation "0or1row" failed (exception ERROR, "could not receive data from server: Connection timed out ")

The problem with caching is: when it grows to the limit, the cache flush operation seems to last forever and the DB call to reload the cache goes to a timeout.

I still have no clue about the reasons for this behaviour.

12: Re: Strange problem with util_memoize flush (response to 11)

Posted by Gustaf Neumann on 03/04/11 02:26 PM

The symptoms on your system are that every database query suddenly takes 928 seconds! Unless you have millions of host-node-entries defined,

rp_lookup_

node_from_host.node_id

should take significantly less than one ms. Running out of db-ressources can explain the symptoms, running out of space in a cache of the aolserver not (caches are no supposed to keep all information put into it, they are supposed to keep the most important ones, once the space is not sufficient to keep all information).

What is the configuration value for max_connections in your postgres configuration?

13: Re: Strange problem with util_memoize flush (response to 12)

Posted by Eduardo Santos on 03/04/11 04:14 PM

Hi Gustaf,

I had also these same toughts. The thing is that I've already reviewed 100 times PostgreSQL configuration and I couldn't find any bottleneck. The machine is ok with resources, PostgreSQL is well tuned and we never get out of resources.

I've configured 200 in max_connections, but it never get more than 30 at the same time. I've double checked Warm Stand By config, shared_buffers, kernel shared memory, everyhing that I know and PostgreSQL seems to be ok.

It seems like a cache problem to me because it only happens when util_memoize cache goes full. In any other situation the queries take the regular amount of time.

In PostgreSQL log I keep seeing the same timeout messages:

2011-01-14 21:08:43 BRST LOG: não pôde receber dados do cliente: Tempo esgotado para conexão

It's a timeout message saying that the server got a EOF message from the client connection.

This problem is being very difficult to catch.

How do AOLServer implement connection pooling? Are they persistent connections? Is there a separated pool for cache renew queries?

I'm asking this because there can be some connection reset on the network structure based on connection time. Maybe this can be problem (I'm running out of ideas).

Gustaf, thank you very much for your time and your answers. They're being very helpfull.

14: Re: Strange problem with util_memoize flush (response to 13)

Posted by Gustaf Neumann on 03/05/11 02:01 PM

If i trust google translate, the message says: "could not receive data from client: Connection timed out" which clearly indicates a database problem (as the 900+ secs times).

Do you have MaxIdle and/or MaxOpen set for pool1 or pool2? If so, remove your settings.

Concerning your questions: ns_cache is completely agnostic about the database. The worker threads (e.g. connection threads) require db-handles (references to db-connections) via gethandle, they require in most cases a pool1 connection and sometimes a pool2 connection (for subqueries). The handles are released either explicitly (happens seldomly) or automatically when a request is done and the worker thread is done.

Forum OpenACS Development: Re: Strange problem with util_memoize flush