Forum OpenACS Development: Re: Scalability in site node initialization routine

Ok, DS always says that the DB queries takes usually less that half a second.
And all the pages takes the most of its time in this step (according to DS), examples of different requests:

+31.1 ms: Served file /var/www/migration_test2/packages/dotlrn/www/one-community.adp with adp_parse_ad_conn_file - 38167.7 ms

OR

+31.9 ms: Served file /var/www/migration_test2/packages/dotlrn/www/members.adp with adp_parse_ad_conn_file - 64022.8 ms

OR

+40.1 ms: Served file /var/www/migration_test2/packages/dotlrn/www/index.adp with adp_parse_ad_conn_file - 17027.9 ms

OR

+39.4 ms: Served file /var/www/migration_test2/packages/acs-admin/www/index.adp with adp_parse_ad_conn_file - 31766.9 ms

OR

+31.1 ms: Served file /var/www/migration_test2/packages/acs-lang/www/admin/index.adp with adp_parse_ad_conn_file - 6373.0 ms

whenever I hit a page, in the machine that has aolserver running, the processor gets almost 100% used, and like 512MB Ram is still free. The DB server never gets fully used, actually, is always returning the query request quite fast.

This is our configuration:
Aolserver (oacs): Dual Pentium III 1.4Ghz, 2GB ram, SCSI drives (dell blade server)
PG 7.4.1: Dual Xeon 2.8 Ghz 4GB ram, SCSI

Any suggestions?

Collapse
Posted by Tom Ayles on
When I was testing how a project would perform with large numbers of site nodes, I came across a similar problem. The way I went about tracing the issue was to stick a bunch of statements like the following:

ns_log Notice "[clock clicks -milliseconds] entered blah.tcl"

...into the tops of each main Tcl script and template script that got called in turn, so that I could get a rough idea of which Tcl file was taking the time from looking at the error log. I also wrote a little 10-line Perl script that could take these log statements and calculate the deltas - though that's more for convenience than anything (can send you the script if you want, let me know). Once I'd found which scripts were taking the time, I added more logging statements like that to track which Tcl calls were to blame until I isolated the problem. In my case (on OpenACS 5.0-based system), I tracked it down to site_node::get_url_from_object_id which used to scan through the entire site nodes nsv to get its result. A glance at current code makes me think this issue is fixed in 5.1 with an nsv set from object_id to url.

Collapse
Posted by Jeff Davis on
make sure you have vacuumed (vacuumdb -f -z -v DBNAME) and you should check that your pg buffers and sortmem are large enough and that you have upped the kernel shared mem limits (all I think covered in the standard docs iirc).

How large is the aolserver process? Have you checked if the machine is thrashing (with vmstat for example), I know it says 512mb is free but that might mean you are bumping into some per process memory limit.