Forum OpenACS Development: Site nodes scaling problem

Posted by Andrew Grumet on 08/27/04 06:42 PM

We're running into some scaling issues on a .LRN site with a large number (>33,000) of site nodes.

Each call to site_node::update_cache takes about 10 seconds. When creating a new .LRN class, this proc is called 22 times. Creating a new class takes about 6 minutes during which the nsd process maxes out the CPU.

Most of the time in site_node::update_cache is spent in the four calls to "array set" at the top and in the four calls to "nsv_array reset" at the bottom. This is not surprising given the size of the data structures involved.

The question that occurs to me is, do we have to deal with the entire site map atomically, or can we cache at the individual node/url level?

Looking back through old code, we actually did it this way up until late November 2003. Only with r1.48 of site-nodes-procs.tcl did we begin to deal with the site map as a whole. Timo's commit message says "populate site-nodes-cache by resolving urls in tcl instead of using the slow plsql". I'm not clear yet on whether this change strictly requires whole-map caching instead of individual-url caching. But from a scaling pov I think we want to go back to individual caching if possible.

Comments?

2: Re: Site nodes scaling problem (response to 1)

Posted by Don Baccus on 08/27/04 06:51 PM

Did the code use a nsv_array before Timo's change, or did he just replace some PL/SQL with Tcl in the code that rebuilds the array?

We know that AOLserver 4 will help speed the Tcl side, but 6 minutes divded by (say) a double performance boost is still 3 minutes.

As I commented in e-mail ... if we used regular nsvs with the URL as the key, rather than an array, we could modify the nsv for an individual URL automically and also add new URLs to the map atomically without going through all this update_cache crap...

So, what are the comparative costs of a nsv vs. nsv_array?

Maybe dossy's in IRC...

6: Re: Site nodes scaling problem (response to 2)

Posted by Andrew Piskorski on 08/28/04 10:53 PM

An nsv is an nsv, there is no such thing as "a nsv_array".

Other than that, Don sounds exactly right. :)

3: Re: Site nodes scaling problem (response to 1)

Posted by Rocael Hernández Rizzardini on 08/27/04 07:18 PM

Galileo is experiencing the same troubles with aol4 ....
we have about 3.8k class instances and about 40K site-nodes ..., its quite slow to create a new class instance ...

4: Re: Site nodes scaling problem (response to 1)

Posted by Dirk Gomez on 08/27/04 09:15 PM

Andrew, what happens if you back out those changes on your development instance that has loaded data?

What about populating the caches asynchronously e.g. in another thread? Or cache lazily i.e. after a site-node has indeed been hit. Or just add the new site-nodes to the cache?

5: Re: Site nodes scaling problem (response to 1)

Posted by Jonathan Ellis on 08/28/04 04:24 PM

nsv is quite fast -- not as fast as a native tcl array for lookup and setting but it's going to be very significantly faster than an array set approach for large structures. array set is O(N) for N keys, after all.

7: Re: Site nodes scaling problem (response to 1)

Posted by Jonathan Ellis on 08/29/04 01:32 AM

However there is a command, nsv_array, that provides a similar api to nsvs that the tcl array command does to builtin tcl arrays. E.g., "nsv_array names my_nsv_name." I assume that is what Don was referring to...

8: Re: Site nodes scaling problem (response to 1)

Posted by Ola Hansson on 08/29/04 12:33 PM

If the most time is spent in the four calls at the top and the four calls at the bottom, how about refraining from calling them and thus avoid to copy the the data structures (call by value I suppose)? For instance, it should be perfectly possible to call "nsv_array get site_node_url_by_node_id $node_id" inline to get the url for a node, and quite quickly at that (about the same speed as retreiving it from a copy in a native Tcl array is my guess).

As for updating the cache(s) it ought to be possible to do that inline as well, in a surgical fashion, as opposed to bulk over-writing the old nsv_arrays with the updated Tcl arrays. (I think!)

Another thing which struck me: Assuming I'm right in that the data structures (nsv_array and Tcl array) are copied "by value", is it then possible to do that "by reference" instead?

9: Re: Site nodes scaling problem (response to 1)

Posted by Andrew Grumet on 08/29/04 10:59 PM

There's an update on the oacs-5-1 branch that does targetted cache writes for individual nodes in site_node::new and site_node::mount, which is where most of the delay comes from.

I don't think we've seen the end of site node related scaling fixes but this solves some important ones.

10: Re: Site nodes scaling problem (response to 1)

Posted by Rocael Hernández Rizzardini on 08/30/04 05:55 AM

<blockquote>>I don't think we've seen the end of site node related scaling fixes but this solves some important ones.
</blockquote>

true, site nodes needs more work ... curriculum bar should be removed from dotlrn-master.tcl, since makes real bad performance, even if you don't have installed the pkg, for any site with more than 20k site nodes...