Forum OpenACS Development: Scalability in site node initialization routine

I ran into an issue today with the 5.1 version of site-nodes-init.tcl taking 40 minutes to run where the 5.0 version took 90 seconds.

The culprit appears to be the lsort in site_node::update_cache which is run each time the list of urls for an object_id grows.

This is slow because our .LRN has a 4674 urls, one per community, for the same instance of the attachments package.  The list of urls for this object was getting resorted thousands of times.  Ouch.

The fix is to move the sort outside loop, so that the url list for each object_id are only sorted once.

I'll commit this to 5.1 branch and HEAD if nobody objects.

Alternatively, we could consider whether the sort should be moved out to site_node::get_url_from_object_id or even further to consuming code to sort.

we are facing bad scalability problems with .LRN under 5.0 & 5.1, about 37465 site_nodes, and ALL the pages are quite slow (using PG). Does anyone has experienced this bad performance with such amount of site_nodes? so far seems that the problem isn't in the DB side...
Collapse
Posted by Malte Sussdorff on
It does not sound like the site nodes would be the problem, unless AOLserver runs out of cache. What does developer support say, especially as you say all pages are slow, the culprit should be easy to find, as some common ground should show up when comparing DS report on multiple pages.
Ok, DS always says that the DB queries takes usually less that half a second.
And all the pages takes the most of its time in this step (according to DS), examples of different requests:

+31.1 ms: Served file /var/www/migration_test2/packages/dotlrn/www/one-community.adp with adp_parse_ad_conn_file - 38167.7 ms

OR

+31.9 ms: Served file /var/www/migration_test2/packages/dotlrn/www/members.adp with adp_parse_ad_conn_file - 64022.8 ms

OR

+40.1 ms: Served file /var/www/migration_test2/packages/dotlrn/www/index.adp with adp_parse_ad_conn_file - 17027.9 ms

OR

+39.4 ms: Served file /var/www/migration_test2/packages/acs-admin/www/index.adp with adp_parse_ad_conn_file - 31766.9 ms

OR

+31.1 ms: Served file /var/www/migration_test2/packages/acs-lang/www/admin/index.adp with adp_parse_ad_conn_file - 6373.0 ms

whenever I hit a page, in the machine that has aolserver running, the processor gets almost 100% used, and like 512MB Ram is still free. The DB server never gets fully used, actually, is always returning the query request quite fast.

This is our configuration:
Aolserver (oacs): Dual Pentium III 1.4Ghz, 2GB ram, SCSI drives (dell blade server)
PG 7.4.1: Dual Xeon 2.8 Ghz 4GB ram, SCSI

Any suggestions?

Collapse
Posted by Tom Ayles on
When I was testing how a project would perform with large numbers of site nodes, I came across a similar problem. The way I went about tracing the issue was to stick a bunch of statements like the following:

ns_log Notice "[clock clicks -milliseconds] entered blah.tcl"

...into the tops of each main Tcl script and template script that got called in turn, so that I could get a rough idea of which Tcl file was taking the time from looking at the error log. I also wrote a little 10-line Perl script that could take these log statements and calculate the deltas - though that's more for convenience than anything (can send you the script if you want, let me know). Once I'd found which scripts were taking the time, I added more logging statements like that to track which Tcl calls were to blame until I isolated the problem. In my case (on OpenACS 5.0-based system), I tracked it down to site_node::get_url_from_object_id which used to scan through the entire site nodes nsv to get its result. A glance at current code makes me think this issue is fixed in 5.1 with an nsv set from object_id to url.

Collapse
Posted by Jeff Davis on
make sure you have vacuumed (vacuumdb -f -z -v DBNAME) and you should check that your pg buffers and sortmem are large enough and that you have upped the kernel shared mem limits (all I think covered in the standard docs iirc).

How large is the aolserver process? Have you checked if the machine is thrashing (with vmstat for example), I know it says 512mb is free but that might mean you are bumping into some per process memory limit.

Collapse
Posted by Don Baccus on
Oh, you might want to talk to Lars since he's the person who I believe made the site node changes.  These slowed down one of the procs horribly though I don't remember offhand which.

Or you might look at the change history to the file to see who else has been fiddling with that code.

It used to work well, obviously, so I'm not sure what the motivation for change was in the first place.

Collapse
Posted by Andrew Grumet on
I've committed the lsort move to site-nodes-procs.tcl on the 5.1 branch and on HEAD.
Collapse
Posted by Caroline Meeks on
I've also had curriculum bar in site-master cause scalability issues.
Collapse
Posted by Andrew Grumet on
Yes I'm seeing that too.  Short term it's commented it out.
Collapse
Posted by Ola Hansson on
Caroline, Andrew,

The "set curriculum_bar_p" statement in site-master.tcl has been cached on openacs-5-1-0-final, openacs-5-1-compat and HEAD.

Jeff wrote me (in private) and pointed out that I shouldn't use util_memoize in the page. I agree with that. He suggested a proc like so:

site_nodes::package_mounted_p package_name site_node

...which would return the memoized result for the curriculum package and other packages as well.

He said he would also add a flush in the mount to flush the memoized ones for a given package_name.

I haven't gotten around to do this yet, but it is my intention to follow his advice sooner or later.

I had the same problem, basically it finally call site_node::get_children, which when you have lot of nodes can be very slow (since it scan all the nodes when filters are defined: package_key curriculum), for +30K nodes takes about 5sec to serve any page....  has this fix done for oacs-5-1?
Collapse
Posted by Ola Hansson on
Yes, it is cached on oacs-5-1. However, the problem is that it still takes a long time on the first page hit.

I think (but haven't tested it to verify) that a small increase in speed to site_node::get_children could be achieved by changing site_node::get_from_url (and also get_url) and use something like this:


if { ![catch { nsv_get site_nodes $url } result] } {
    return $result
}

instead of this:
    if {[nsv_exists site_nodes $url]} {
        return [nsv_get site_nodes $url]
    }

It might help since I suppose nsv_exists would take almost as long to execute as nsv_get ... but that is only my assumption.
Collapse
Posted by Jeff Davis on
I timed this and your method is 3 usec/iter for one thats there and the old version is 4 usec/iter. If it's not there the old version is 2 and your new version is 9.

I think we should look elsewhere for optimizations as shaving off a microsecond or two here is not going to make a difference.

Collapse
Posted by Ola Hansson on
How about this, then? In site_node::get_children, at the top or somewhere more global, add this:
array set site_nodes_all [nsv_array get site_nodes]
.. then in the "filter loop" replace
array set site_node [get_from_url -exact -url $child_url]
with
if { [catch { array set site_node $site_nodes_all($child_url) } errmsg] } {
   error "$errmsg"
}
That may be quicker. Again, I haven't tested 😟
Collapse
Posted by Ola Hansson on
There could also be a switch added to site_node::get_children that when enabled would make the proc return "1" as soon as a child url had "passed" and "0" otherwise. That way you wouldn't have to wait for it to loop through hundreds or thousands of iterations in vain ...

I don't know what the switch would be called, though, or how beneficial you think it would be.

Collapse
Posted by Ola Hansson on
When I am running this command in the OpenACS shell...
time { site_node::get_children -all -package_key curriculum -node_id _main-site-node-id_ } 10
...I get the following timings on my Athlon XP2500 / 1 GB ram:

- 10 nodes
existing proc: 1700 micro seconds/iteration
changed proc: 800 micro seconds/iteration

- 1123 nodes
existing proc: 170000 micro seconds/iteration
changed proc: 85000 micro seconds/iteration

- 2233 nodes
existing proc: 340000 micro seconds/iteration
changed proc: 180000 micro seconds/iteration

So the speed is only doubled and the gain seems to be decreasing as the number of nodes increase. This doesn't seem too worthwhile but I don't know ...

One other thing I think generally slows down "get_childern" unnecessarily (although not in this case) is the repeted foreach loop for the child urls that passed the filter when you specify a specific element to be in the returned list instead of the default url. It would be rather easy to move (or copy) that part of the code and let it exist inside the "if $passed_p" clause in the main filter loop, I think.

Also, is it intentional that "-exact" is not being used in the call to "get_children" in the last foreach, or is it by misstake? I fail to see why this call should differ from the one in the filter loop, at least.

Collapse
Posted by Jeff Davis on
I committed I revised version of site_node::get_children to HEAD which is about 3 times faster on my local install (with about 100 nodes).

The big win was not doing the array set twice, and not copying the full list of child_urls unless you needed to. And in a somewhat gratuitous optimization, I got rid of the regexp to figure out if a node was an immediate child (which I think saved 2usec/node).

the -exact did not really matter since it tries an exact match first and since it got the names from the nsv in the first place there is always an exact match (although I guess if you deleted a node while this call was running you might get the wrong result there). And in any case I switched it to do a direct nsv_get.

Collapse
Posted by Ola Hansson on
Thank you Jeff!

With my 2233 nodes your version is around 3.4 times faster than the original.

Jeff, I just tested this with 32068 on oacs-5-1 and is fast enough (like if there were not so many nodes), can you commit this to oacs-5-1?
I will do further testing on this in the next days and keep you posted if we found more problems like this.
Collapse
Posted by Ola Hansson on
I have committed a new version of site-master.tcl to HEAD and oacs-5-1 which actually should be very fast.

I changed to a new check that uses a call to "site_node::get_package_url -package_key curriculum" to determine whether to include the bar or not. That proc has been changed by Lars so that it uses a site node-nsv keyed by package_key. Previously it used "get_children" internally.

Sadly, it looks like I made a misstake when I util_memoize'd "set curriculum_bar_p" in site-master.tcl ...

This works:

time { set curriculum_bar_p [llength [util_memoize [list site_node::get_children -all -package_key curriculum -element directory_p -node_id 405]]] } 10
This doesn't work (the result doesn't get cached):
time { set curriculum_bar_p [util_memoize [list llength [site_node::get_children -all -package_key curriculum -element directory_p -node_id 405]]] } 10
And it was the latter that was in site-master.tcl in cvs, so for everyone that uses .LRN and haven't commented out the statement, this explains the extreme slowness! But if you update your code you should be fine.

I am very sorry for the inconvenience.