Forum OpenACS Development: Site nodes scaling problem

Collapse
Posted by Andrew Grumet on
We're running into some scaling issues on a .LRN site with a large number (>33,000) of site nodes.

Each call to site_node::update_cache takes about 10 seconds.  When creating a new .LRN class, this proc is called 22 times.  Creating a new class takes about 6 minutes during which the nsd process maxes out the CPU.

Most of the time in site_node::update_cache is spent in the four calls to "array set" at the top and in the four calls to "nsv_array reset" at the bottom.  This is not surprising given the size of the data structures involved.

The question that occurs to me is, do we have to deal with the entire site map atomically, or can we cache at the individual node/url level?

Looking back through old code, we actually did it this way up until late November 2003.  Only with r1.48 of site-nodes-procs.tcl did we begin to deal with the site map as a whole.  Timo's commit message says "populate site-nodes-cache by resolving urls in tcl instead of using the slow plsql".  I'm not clear yet on whether this change strictly requires whole-map caching instead of individual-url caching.  But from a scaling pov I think we want to go back to individual caching if possible.

Comments?

Collapse
Posted by Don Baccus on
Did the code use a nsv_array before Timo's change, or did he just replace some PL/SQL with Tcl in the code that rebuilds the array?

We know that AOLserver 4 will help speed the Tcl side, but 6 minutes divded by (say) a double performance boost is still 3 minutes.

As I commented in e-mail ... if we used regular nsvs with the URL as the key, rather than an array, we could modify the nsv for an individual URL automically and also add new URLs to the map atomically without going through all this update_cache crap...

So, what are the comparative costs of a nsv vs. nsv_array?

Maybe dossy's in IRC...

Collapse
Posted by Andrew Piskorski on
An nsv is an nsv, there is no such thing as "a nsv_array".

Other than that, Don sounds exactly right. :)

Collapse
Posted by Rocael Hernández Rizzardini on
Galileo is experiencing the same troubles with aol4 ....
we have about 3.8k class instances and about 40K site-nodes ..., its quite slow to create a new class instance ...
Collapse
Posted by Dirk Gomez on
Andrew, what happens if you back out those changes on your development instance that has loaded data?

What about populating the caches asynchronously e.g. in another thread? Or cache lazily i.e. after a site-node has indeed been hit. Or just add the new site-nodes to the cache?

Collapse
Posted by Jonathan Ellis on
nsv is quite fast -- not as fast as a native tcl array for lookup and setting but it's going to be very significantly faster than an array set approach for large structures.  array set is O(N) for N keys, after all.
Collapse
Posted by Jonathan Ellis on
However there is a command, nsv_array, that provides a similar api to nsvs that the tcl array command does to builtin tcl arrays.  E.g., "nsv_array names my_nsv_name."  I assume that is what Don was referring to...
Collapse
Posted by Ola Hansson on
If the most time is spent in the four calls at the top and the four calls at the bottom, how about refraining from calling them and thus avoid to copy the the data structures (call by value I suppose)? For instance, it should be perfectly possible to call "nsv_array get site_node_url_by_node_id $node_id" inline to get the url for a node, and quite quickly at that (about the same speed as retreiving it from a copy in a native Tcl array is my guess).

As for updating the cache(s) it ought to be possible to do that inline as well, in a surgical fashion, as opposed to bulk over-writing the old nsv_arrays with the updated Tcl arrays. (I think!)

Another thing which struck me: Assuming I'm right in that the data structures (nsv_array and Tcl array) are copied "by value", is it then possible to do that "by reference" instead?

Collapse
Posted by Andrew Grumet on
There's an update on the oacs-5-1 branch that does targetted cache writes for individual nodes in site_node::new and site_node::mount, which is where most of the delay comes from.

I don't think we've seen the end of site node related scaling fixes but this solves some important ones.

Collapse
Posted by Rocael Hernández Rizzardini on
<blockquote>>I don't think we've seen the end of site node related scaling fixes but this solves some important ones.
</blockquote>

true, site nodes needs more work ... curriculum bar should be removed from dotlrn-master.tcl, since makes real bad performance, even if you don't have installed the pkg, for any site with more than 20k site nodes...