Forum .LRN Q&A: Cluster and performance

Posted by Jose Agustin Lopez Bueno on 02/17/04 06:38 PM

Hello, all!

Following with performance tests with dot,
we are seeing a great decrease of performance
with more classes are added to the site.
With 2000 classes and subgroups, we get
5 seconds to navigate between pages.

We need add more classes. With that
response times the system will be unusable.
Is it possible to use the cluster referenced in
Kernel parameters?
Anybody is using it? And documentation?

(We are using:
2 Gb RAM, Opteron Dual Processor,
aolserver 4-0, dotlrn-2.0.0rc1, postgresql-7.4.1)

Regards,
Agustin

2: Re: Cluster and performance (response to 1)

Posted by Dirk Gomez on 02/17/04 07:58 PM

You should first analyze your performace problem. From your description it is more likely that the database is the performance bottleneck, not AOLserver.

There's a page in the docs somewhere that describes how to setup acs-developer-support which will help you to gather statistics.

3: Re: Cluster and performance (response to 1)

Posted by Caroline Meeks on 02/17/04 08:05 PM

Let me echo Dirk. Please use developer support to analyze which queries are taking the most time then you can think about query optomization and caching strategies as well as adding more hardware. There are a number of different caching strategies that have been used in dotLRN and for Greenpeace.

Developer support is easy to install from your "install software" page. One warning: Don't turn on "User Switching" if you have more then 50 or so users in your database.

4: Re: Cluster and performance (response to 3)

Posted by Joel Aufrecht on 02/17/04 10:36 PM

Diagnosing Performance Problems: https://openacs.org/doc/openacs-HEAD/maint-performance.html

5: Re: Cluster and performance (response to 1)

Posted by Jose Agustin Lopez Bueno on 02/18/04 08:19 AM

Hello again!

Sorry. I was installed developper support and
the bottleneck is not Postgres. It is OpenACS
(or dot).

By example, the page where is shown the users
of one class -> 20 seconds . If I comment the code
where is assigned one url to first name and one url
to last name for every user the page is loaded in 2
seconds.

We are detected two problems more like this.
But we can not decrease the load timeof 3 - 5
seconds when the number of classes are greater
tha 2000.

Any pointer, please?

Regards,
Agustin

6: Re: Cluster and performance (response to 5)

Posted by Dirk Gomez on 02/18/04 09:50 AM

What is the name of the function and the name of the file which you commented out.

What are the two other problems?

7: Re: Cluster and performance (response to 1)

Posted by Jose Agustin Lopez Bueno on 02/18/04 04:21 PM

Hello again!

At last I have detected where is the problem:

The function is
site_node::get_from_object_id
in
packages/acs-tcl/tcl/site-nodes-procs.tcl
(the source is below)

This function is VERY,VERY slow and inefficient
when the number of objects is big.
Its behavior implies to all the application.

Please, if you can give a pointer to modify the
code, please email me.

Source code:

***************************************************
site_node::get_url_from_object_id__arg_parser

set sort [list]
foreach url [nsv_array names site_nodes] {
ns_log Notice $url
lappend sort [list $url [string length $url]]
}

set sorted [lsort -index 1 $sort]

foreach elm $sorted {
set url [lindex $elm 0]
array unset site_node
array set site_node [site_node::get_from_url \
-url $url]
if { $site_node(object_id) == $object_id } {
return $url
}
}

return {}

return [db_list select_url_from_object_id {}]
***************************************************

Subscript: What do the second return? Or is a mistake?

Best regards,
Agustin

8: Re: Cluster and performance (response to 1)

Posted by Don Baccus on 02/18/04 08:48 PM

The second return is a mistake and the entire procedure is HORRIBLY written, thanks for uncovering it ...

9: Re: Cluster and performance (response to 1)

Posted by Don Baccus on 02/18/04 09:39 PM

OK, I've looked into the history of the proc. It appears to be the result of a misguided attempt to speed up the original version which queried the database to get the data.

Unfortunately this replaced the index-driven db query - O(log2(n)) - with a linear search of the cached site node map - O(n). The rewrite also changed the semantics of the proc, without changing the documentation. And of course the original return was left in, dangling after the return {}.

Though the rewrite is probably faster for site node maps of modest size (since it is done directly in Tcl rather than via a db query), eventually O(n) loses to O(log2(n)) every time, as n grows. Apparently you've shown that when n is "several thousand" the rewrite loses quite badly :)

I've contacted the author of the rewrite and we'll work out a fix of some sort.

Thanks for your detective work!

10: Re: Cluster and performance (response to 1)

Posted by Jose Agustin Lopez Bueno on 02/19/04 09:10 AM

Hello,Don!

Thanks for the quick answer!

I have replaced the code of the proc with the
old version (see below) and the performace increase
greatly. Some pages with time load de 20 seconds
-> 2-3 seconds.

I expect you change the code in the official release.

Thanks very much!
Agustin

*************************************************
ad_proc -public site_node::get_url_from_object_id {
{-object_id:required}
} {
returns a list of urls for site_nodes that have the given object
mounted or the empty list if there are none. The
url:s will be returned in descending order meaning any children will
come before their parents. This ordering is useful when deleting site nodes
as we must delete child site nodes before their parents.
} {
return [db_list select_url_from_object_id {}]
}
*************************************************

11: Re: Cluster and performance (response to 1)

Posted by Caroline Meeks on 03/19/04 07:19 PM

Hi,

I've also just loaded data into dotLRN and I'm tracking down performance issues.

Has this fix made it to CVS yet? I'm working on a few months old code base so I am trying to find the specific files to upgrade and I'm not having much luck. Agustin do you happend to have a patch you can email?

thanks
Caroline

12: Re: Cluster and performance (response to 11)

Posted by Tracy Adams on 03/26/04 08:49 PM

Caroline/Don/Jose,

I'm about to cut .LRN 2.0.3 and want to make sure critical items get in it.

From reading the thread above
-- It looks like the root of the problem is in openACS core code?
-- Has this been fixed in the central repository?
-- If it is fixed, what release will it appear in? (I'm guessing the next release - 5.1)

Thank you,
Tracy