Forum OpenACS Development: Re: Scalability in site node initialization routine

we are facing bad scalability problems with .LRN under 5.0 & 5.1, about 37465 site_nodes, and ALL the pages are quite slow (using PG). Does anyone has experienced this bad performance with such amount of site_nodes? so far seems that the problem isn't in the DB side...
Collapse
Posted by Malte Sussdorff on
It does not sound like the site nodes would be the problem, unless AOLserver runs out of cache. What does developer support say, especially as you say all pages are slow, the culprit should be easy to find, as some common ground should show up when comparing DS report on multiple pages.
Ok, DS always says that the DB queries takes usually less that half a second.
And all the pages takes the most of its time in this step (according to DS), examples of different requests:

+31.1 ms: Served file /var/www/migration_test2/packages/dotlrn/www/one-community.adp with adp_parse_ad_conn_file - 38167.7 ms

OR

+31.9 ms: Served file /var/www/migration_test2/packages/dotlrn/www/members.adp with adp_parse_ad_conn_file - 64022.8 ms

OR

+40.1 ms: Served file /var/www/migration_test2/packages/dotlrn/www/index.adp with adp_parse_ad_conn_file - 17027.9 ms

OR

+39.4 ms: Served file /var/www/migration_test2/packages/acs-admin/www/index.adp with adp_parse_ad_conn_file - 31766.9 ms

OR

+31.1 ms: Served file /var/www/migration_test2/packages/acs-lang/www/admin/index.adp with adp_parse_ad_conn_file - 6373.0 ms

whenever I hit a page, in the machine that has aolserver running, the processor gets almost 100% used, and like 512MB Ram is still free. The DB server never gets fully used, actually, is always returning the query request quite fast.

This is our configuration:
Aolserver (oacs): Dual Pentium III 1.4Ghz, 2GB ram, SCSI drives (dell blade server)
PG 7.4.1: Dual Xeon 2.8 Ghz 4GB ram, SCSI

Any suggestions?

Collapse
Posted by Tom Ayles on
When I was testing how a project would perform with large numbers of site nodes, I came across a similar problem. The way I went about tracing the issue was to stick a bunch of statements like the following:

ns_log Notice "[clock clicks -milliseconds] entered blah.tcl"

...into the tops of each main Tcl script and template script that got called in turn, so that I could get a rough idea of which Tcl file was taking the time from looking at the error log. I also wrote a little 10-line Perl script that could take these log statements and calculate the deltas - though that's more for convenience than anything (can send you the script if you want, let me know). Once I'd found which scripts were taking the time, I added more logging statements like that to track which Tcl calls were to blame until I isolated the problem. In my case (on OpenACS 5.0-based system), I tracked it down to site_node::get_url_from_object_id which used to scan through the entire site nodes nsv to get its result. A glance at current code makes me think this issue is fixed in 5.1 with an nsv set from object_id to url.

Collapse
Posted by Jeff Davis on
make sure you have vacuumed (vacuumdb -f -z -v DBNAME) and you should check that your pg buffers and sortmem are large enough and that you have upped the kernel shared mem limits (all I think covered in the standard docs iirc).

How large is the aolserver process? Have you checked if the machine is thrashing (with vmstat for example), I know it says 512mb is free but that might mean you are bumping into some per process memory limit.