Forum .LRN Q&A: Cluster and performance

Collapse
Posted by Jose Agustin Lopez Bueno on
Hello, all!

Following with performance tests with dot,
we are seeing a great decrease of performance
with more classes are added to the site.
With 2000 classes and subgroups, we get
5 seconds to navigate between pages.

We need add more classes. With that
response times the system will be unusable.
Is it possible to use the cluster referenced in
Kernel parameters?
Anybody is using it? And documentation?

(We are using:
2 Gb RAM, Opteron Dual Processor,
aolserver 4-0, dotlrn-2.0.0rc1, postgresql-7.4.1)

Regards,
Agustin

Collapse
Posted by Dirk Gomez on
You should first analyze your performace problem. From your description it is more likely that the database is the performance bottleneck, not AOLserver.

There's a page in the docs somewhere that describes how to setup acs-developer-support which will help you to gather statistics.

Collapse
Posted by Caroline Meeks on
Let me echo Dirk. Please use developer support to analyze which queries are taking the most time then you can think about query optomization and caching strategies as well as adding more hardware.  There are a number of different caching strategies that have been used in dotLRN and for Greenpeace.

Developer support is easy to install from your "install software" page.  One warning: Don't turn on "User Switching" if you have more then 50 or so users in your database.

Collapse
Posted by Joel Aufrecht on
Collapse
Posted by Jose Agustin Lopez Bueno on
Hello again!

Sorry. I was installed developper support and
the bottleneck is not Postgres. It is OpenACS
(or dot).

By example, the page where is shown the users
of one class -> 20 seconds . If I comment the code
where is assigned one url to first name and one url
to last name for every user the page is loaded in 2
seconds.

We are detected two problems more like this.
But we can not decrease the load timeof 3 - 5
seconds when the number of classes are greater
tha 2000.

Any pointer, please?

Regards,
Agustin

Collapse
Posted by Dirk Gomez on
What is the name of the function and the name of the file which you commented out.

What are the two other problems?

Collapse
Posted by Jose Agustin Lopez Bueno on
Hello again!

At last I have detected where is the problem:

The function is
  site_node::get_from_object_id
in
  packages/acs-tcl/tcl/site-nodes-procs.tcl
  (the source is below)

This function is VERY,VERY slow and inefficient
when the number of objects is big.
Its behavior implies to all the application.

Please, if you can give a pointer to modify the
code, please email me.

Source code:

***************************************************
    site_node::get_url_from_object_id__arg_parser

    set sort [list]
    foreach url [nsv_array names site_nodes] {
        ns_log Notice $url
        lappend sort [list $url [string length $url]]
    }

    set sorted [lsort -index 1 $sort]

    foreach elm $sorted {
        set url [lindex $elm 0]
        array unset site_node
        array set site_node [site_node::get_from_url \
          -url $url]
        if { $site_node(object_id) == $object_id } {
            return $url
        }
    }

    return {}

    return [db_list select_url_from_object_id {}]
***************************************************

Subscript: What do the second return? Or is a mistake?

Best regards,
Agustin

Collapse
Posted by Don Baccus on
The second return is a mistake and the entire procedure is HORRIBLY written, thanks for uncovering it ...
Collapse
Posted by Don Baccus on
OK, I've looked into the history of the proc.  It appears to be the result of a misguided attempt to speed up the original version which queried the database to get the data.

Unfortunately this replaced the index-driven db query - O(log2(n)) - with a linear search of the cached site node map - O(n).  The rewrite also changed the semantics of the proc, without changing the documentation.  And of course the original return was left in, dangling after the return {}.

Though the rewrite is probably faster for site node maps of modest size (since it is done directly in Tcl rather than via a db query), eventually O(n) loses to O(log2(n)) every time, as n grows.  Apparently you've shown that when n is "several thousand" the rewrite loses quite badly :)

I've contacted the author of the rewrite and we'll work out a fix of some sort.

Thanks for your detective work!

Collapse
Posted by Jose Agustin Lopez Bueno on
Hello,Don!

Thanks for the quick answer!

I have replaced the code of the proc with the
old version (see below) and the performace increase
greatly. Some pages with time load de 20 seconds
-> 2-3 seconds.

I expect you change the code in the official release.

Thanks very much!
Agustin

*************************************************
ad_proc -public site_node::get_url_from_object_id {
    {-object_id:required}
} {
    returns a list of urls for site_nodes that have the given object
    mounted or the empty list if there are none. The
    url:s will be returned in descending order meaning any children will
    come before their parents. This ordering is useful when deleting site nodes
    as we must delete child site nodes before their parents.
} {
  return [db_list select_url_from_object_id {}]
}
*************************************************

Collapse
Posted by Caroline Meeks on
Hi,

I've also just loaded data into dotLRN and I'm tracking down performance issues.

Has this fix made it to CVS yet? I'm working on a few months old code base so I am trying to find the specific files to upgrade and I'm not having much luck. Agustin do you happend to have a patch you can email?

thanks
Caroline

Collapse
Posted by Tracy Adams on
Caroline/Don/Jose,

I'm about to cut .LRN 2.0.3 and want to make sure critical items get in it.

From reading the thread above
-- It looks like the root of the problem is in openACS core code?
-- Has this been fixed in the central repository?
-- If it is fixed, what release will it appear in? (I'm guessing the next release - 5.1)

Thank you,
Tracy