Forum .LRN Q&A: .LRN v2 Performance

Collapse
Posted by Alfred Essa on
We upgrade to .LRN 2.x. Overall, we have been successful but we are focusing now on performance. The technical team includes Mark Riedesel, Janine Sisk, Andrew Grumet, and Tracy Adams. As usual we are getting assistance from Don Baccus.

We would like encourage everyone to post their performance issues and experiences with v2 so that we can all learn from each other's experiences.

Valencia, how is your installation going?

Collapse
2: Re: .LRN v2 Performance (response to 1)
Posted by Rocael Hernández Rizzardini on
Galileo found slow (cpu eaters) on /admin/users pages, plus in /dotlrn/admin pages, we'll post specifics plus patches as well.
Also the login is a bit slow.
In the PG case, some troubles with the CR and inline query function calls that does sequential scans.

Now moving all the lob's to the file system (file-storage & now evaluation supports storage on the file system as well!), lob calls tend to be slow in PG.

Specifics comming ...

Collapse
3: Re: .LRN v2 Performance (response to 2)
Posted by Matthias Melcher on
Our newest performance problem was fixed by Dirk,
  https://openacs.org/bugtracker/openacs/bug?bug_number=2089
Collapse
4: Re: .LRN v2 Performance (response to 1)
Posted by Tracy Adams on
One thing just fixed -  The Oracle query for sending bulk mail to people who HAVE not taken the survey for large groups was very slow.  I rewrote the query and submitted. I'm committing now.

We are also having issues with sending lots of emails (notificaiton and bulk mail) which Andrew is working on.

It's uncanny, the Oracle on our new box is coming up with very different query plans that we are used to seeing, particually with IN clauses with subqueries.  The same query will run very differently on our old production box than on our new dev and production boxes.

Collapse
5: Re: .LRN v2 Performance (response to 1)
Posted by Rafael Calvo on
Using RSS news feeds was giving us a big load in Elnet's, even with very few users.
Mark Aufflick fixed it.

Rafael

Collapse
6: Re: .LRN v2 Performance (response to 1)
Posted by Dave Bauer on
Rafael,

Can you give a little more detail (or ask Mark)? I am wondering what caused this and if it made it back into openacs (and when it was comitted, I'd like to look.)

Thanks

Collapse
7: Re: .LRN v2 Performance (response to 4)
Posted by Dirk Gomez on
Tracy, the same Oracle versions? Vastly different init.ora parameters? Up2date statistics?

Actually the CBO should be deterministic. Kinda...it has earned its crystal ball repuation because it is different from minor version to minor version.

You also don't say whether the queries on your new system perform worse?

Alas, stored outlines (asktom.oracle.com and the Kyte book have a lot on it) are not an option because the in subclauses are not using bind variables and there is simply too many different queries executed even with cursor_sharing=force.

(Have you set cursor_sharing = force ?)

Collapse
8: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
Rafael is right. I had to stop the usage of news aggregator. It took a simple community page 5-6 seconds to load.
Collapse
9: Re: .LRN v2 Performance (response to 1)
Posted by Janine Ohmer on
Dirk, I've been looking into the query Tracy found.

On our old system it ran relatively fast. Actually it was still kind of slow, but it seems blazingly fast compared to the hour and 20 minutes or so it takes to run on our new system!  As I think Tracy mentioned, performance is fine on the new system if you remove the subquery at the end.

The versions of Oracle are the same and although the init.ora parameters are not identical, they are different in reasonable ways.  So I ran some traces and found that the plans were very different:

Old:

Rows    Row Source Operation
-------  ---------------------------------------------------
    46  FILTER
    385  NESTED LOOPS
    385    HASH JOIN
    384    INDEX RANGE SCAN (object id 29508)
  25322    TABLE ACCESS FULL PARTIES
    768    TABLE ACCESS BY INDEX ROWID ACS_OBJECTS
    768    INDEX UNIQUE SCAN (object id 27037)
    384  NESTED LOOPS
    384    TABLE ACCESS BY INDEX ROWID SURVEY_RESPONSES
12830992    INDEX RANGE SCAN (object id 29670)
    338    INDEX UNIQUE SCAN (object id 27037)

New:

Rows    Row Source Operation
-------  ---------------------------------------------------
    46  FILTER
    385  NESTED LOOPS
    385    HASH JOIN
    384    INDEX RANGE SCAN (object id 26840)
  25417    TABLE ACCESS FULL PARTIES
    768    TABLE ACCESS BY INDEX ROWID ACS_OBJECTS
    768    INDEX UNIQUE SCAN (object id 26395)
    384  NESTED LOOPS
    384    NESTED LOOPS
12334565    VIEW
12334565      SORT GROUP BY
15171456      TABLE ACCESS FULL SURVEY_RESPONSES
12334519    TABLE ACCESS BY INDEX ROWID SURVEY_RESPONSES
12345637      INDEX UNIQUE SCAN (object id 26950)
    338    INDEX UNIQUE SCAN (object id 26395)

Yikes!  Obviously, I went looking for new or changed views.  What I found was that the survey_responses_latest view, which (no surprise) was used in the subquery, has changed.

Old:

create or replace view survey_responses_latest as
select sr.*, o.creation_date,
      o.creation_user,
      survey_response.initial_user_id(sr.response_id) as initial_user_id
  from survey_responses sr,
  acs_objects o
  where sr.response_id = o.object_id
  and latest_response_p = 't';

New:

create or replace view survey_responses_latest as
select sr.*, o.creation_date,
      o.creation_user,
        survey_response.initial_user_id(sr.response_id) as initial_user_id
from survey_responses sr,
  acs_objects o,
  (select max(response_id) as response_id
          from survey_responses
        group by nvl(initial_response_id, response_id)) latest
  where nvl(sr.initial_response_id,sr.response_id) = o.object_id
  and sr.response_id= latest.response_id;

And when I temporarily replaced the new view with the old one, I got the same results as on the old system.

So I think it is not a problem with subqueries in general, just with *this* subquery. :)

Collapse
10: Re: .LRN v2 Performance (response to 9)
Posted by Dirk Gomez on
Why has the latest_response_p field disapperead? The first query just looks fine.
Collapse
11: Re: .LRN v2 Performance (response to 1)
Posted by Janine Ohmer on
The second one is how it has been all along; the first one (the one from our old system) was optimized at Sloan a while back, probably for the same reasons we are seeing now.  Tracy did the original work, so I'll let her explain what she did and why.  Clearly her change, or something similar, needs to go into the toolkit as well.
Collapse
12: Re: .LRN v2 Performance (response to 1)
Posted by Don Baccus on
A big heads-up for PostgreSQL and .LRN which I'm trying to wrap my head around is the fact that on Galileo's installation, certain queries buried inside PL/pgSQL do sequential scans rather than use the primary key index.

With about 1.6 million objects this is annoying, to put it kindly.  When executed in PSQL the index is used.

(this is the situation Rocael refered to rather cryptically above).

I've eliminated various possibilities of the "we just have to force the function to be re-compiled after all the data's been loaded and ANALYZE'd" variety and now believe that it has to do with the optimizer's default notion of how many rows will be processed by a query when an unknown variable rather than specific key appears in the query.

Anyway we're seeing these simple PL/pgSQL calls to the CR take several seconds rather than the milliseconds it takes to do the query inline.

More to come ...

Collapse
13: Re: .LRN v2 Performance (response to 11)
Posted by Tracy Adams on
I don't remember exactly - it was well over a year ago.

A guess, the latest_response_p = t probably was the cause of the full table scan:

15171456      TABLE ACCESS FULL SURVEY_RESPONSES

This is because latest_response_p had lots of values that were "t" and doesn't have differing values to really make it and effectively indexed column.  So Oracle scans at the whole table.

You'll notice that I replaced the lastest_response_p = t with

(select max(response_id) as response_id
          from survey_responses
        group by nvl(initial_response_id, response_id)) latest

that worked more effectively.

Collapse
14: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
Have you already tested the two parameters?

- PermissionCacheP
- PerformanceModeP

I remember they had a gib impact on the performance but broke some packages.

Greetings,
Nima

Collapse
15: Re: .LRN v2 Performance (response to 1)
Posted by Don Baccus on
Performance mode shouldn't break anything ...permissions caching will, I believe.
Collapse
16: Re: .LRN v2 Performance (response to 13)
Posted by Tracy Adams on
I had old and new backwards.

In reality, I had added and maintained a column latest_response_p in order to speed up the survey section.

Collapse
17: Re: .LRN v2 Performance (response to 1)
Posted by Andrew Grumet on
Server restart times can be shorted (by 45 seconds in Sloan's case) by disabling the Automated Testing package in APM.  This service really doesn't belong on a production server, and probably should be disabled by default.
Collapse
18: Re: .LRN v2 Performance (response to 1)
Posted by Rafael Calvo on
Dave

I will try to find out more from Mark. I think he just changed the frequency in which it checks for the feeds.

cheers

Rafael

Collapse
Posted by Rocael Hernández Rizzardini on
Galileo has just moved files of file-storage in the DB (lobs) to the file system (about 3GB of files), a faster render of files has been detected, plus avoiding the use of the DB for getting the files which lets the DB available for whatever else its needed (pg 7.4.3).
We'll do the same for the evaluation package (move lobs to file system) since now that package allows storage on the file system.
Collapse
20: Re: .LRN v2 Performance (response to 1)
Posted by Dario Roig on
Hi!

Now our database is empty and we can't test it, sorry.
The problems the scalability detected with 40.000 users are because the table acs_permissions have 500.000 of rows.

The first of otcubre will have the 15.000 users loaded in the database and then we will make yield tests.

Thanks.
University of Valencia

Collapse
21: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
I think permissions is a real problem too. Why do we need more than a hundered different permission types?
Collapse
22: Re: .LRN v2 Performance (response to 21)
Posted by Dario Roig on
Hi!

If we want all the dotlrn_user to be non guests and ilimited, could we assigned to dotlrn_users the permissions "read_private_data" and  "dotlrn_browse", to have less rows in the table "acs_permissions".

The object_id "dotlrn" have the permissions "read_private_data" and  "dotlrn_browse" for each user.

For example with 40.000 users * 2 permissions = 80.000 rows less in "acs_permissions".

Thanks

Collapse
23: Re: .LRN v2 Performance (response to 20)
Posted by Rocael Hernández Rizzardini on
Galileo has:
select count(*) from acs_permissions;
count
--------
383347
(1 row)

and getting this speed:
132 ms       0or1row       dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1 from dual
where exists
          ( select 1
                  from acs_object_party_privilege_map ppm
        where ppm.object_id = '1228' and ppm.party_id = '16319' and ppm.privilege = 'dotlrn_browse' )

As you see, the permission call has been changed lately, with 500K entries on acs_permissions you should be fine I think (though it can still be improved), specially since your db server if bigger than ours.

Collapse
24: Re: .LRN v2 Performance (response to 1)
Posted by Don Baccus on
132 ms is still slow ... to answer Nima, no, we don't need 100 permission types, and I plan to start whittling them down for OACS 5.2/.LRN 2.2 ...

I think we may see as much as a 5x improvement in permission check speed in this case, after that I still have a couple of thoughts in regard to how to improve performance more.

Collapse
25: Re: .LRN v2 Performance (response to 1)
Posted by Don Baccus on
Also the denormalized table party_object_privilege_map is much larger in PG than in Oracle and that means permissions checking is going to scale better for Oracle than for PG, unfortunately.  This is not a disaster but something to be aware of (PG outperforms Oracle at some other stuff, just to make you feel better!)
Collapse
Posted by Nima Mazloumi on
Hi Rocael,

thank you very much for the script. I added this line to set all file storages to store to file system:
db_dml update_storeFilesInDatabaseP {update apm_parameter_values set attr_value = 0 where parameter_id in (select p.parameter_id from apm_parameters p where parameter_name ='StoreFilesInDatabaseP' and package_key = 'file-storage')}


Greetings,
Nima
Collapse
27: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
We have to leave off the file-storage package_key clause if homework instances should also be included.
Collapse
28: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
Hi all,

I've been working on the dotlrn/www/members page in the last couple of days.

For 835 rows in the multirow the page took 60 sec to display (6 sec database and 51 sec page rendering).

So I moved to
- list builder --> 30 sec less
- changed the line

set member_url [acs_community_member_url -user_id $user_id]

to

set member_url "[dotlrn::get_url]/community-member?user_id=$user_id"

--> 21 sec less

So right now the page is rendered in 9 sec.
The database call to get all current members takes 900 ms while the database call to add the list of supergroup members in case of subgroups takes 5 sec. So next is to improve that query. The other option would be to split that page in two pages - one for current and one for possible members.

The current members page would then take 2 sec to render while the other would take 7 sec.

Please test the above and check if it breaks anything.

Greetings,
Nima

Collapse
29: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
The query that takes 5sec is:

in:

dotlrn_community::list_possible_subcomm_users -subcomm_id subcomm_id

called:

dotlrn_community::list_possible_subcomm_users.select_possible_users

select dotlrn_member_rels_approved.rel_id,
                  dotlrn_member_rels_approved.rel_type,
                  dotlrn_member_rels_approved.role,
                  dotlrn_member_rels_approved.user_id,
                  acs_users_all.first_names,
                  acs_users_all.last_name,
                  acs_users_all.email
            from acs_users_all,
                dotlrn_member_rels_approved,
                dotlrn_communities
            where dotlrn_communities.community_id = :subcomm_id
            and dotlrn_communities.parent_community_id = dotlrn_member_rels_approved.community_id
            and acs_users_all.user_id = dotlrn_member_rels_approved.user_id
            and acs_users_all.user_id not in (select dm.user_id
                                                from dotlrn_member_rels_full dm
                                                where dm.community_id = :subcomm_id)
            order by last_nam

Collapse
30: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
Alright, just realized that the above query is the already improved query (takes 773 ms). here is the corresponding one in 2.0.3 which actually takes 5 sec:

            select dotlrn_member_rels_approved.rel_id,
                  dotlrn_member_rels_approved.rel_type,
                  dotlrn_member_rels_approved.role,
                  dotlrn_member_rels_approved.user_id,
                  registered_users.first_names,
                  registered_users.last_name,
                  registered_users.email
            from registered_users,
                dotlrn_member_rels_approved,
                dotlrn_communities
            where dotlrn_communities.community_id = :subcomm_id
            and dotlrn_communities.parent_community_id = dotlrn_member_rels_approved.community_id
            and registered_users.user_id = dotlrn_member_rels_approved.user_id
            and registered_users.user_id not in (select dm.user_id
                                                from dotlrn_member_rels_full dm
                                                where dm.community_id = :subcomm_id)
            order by last_name

So now the page is served in 5 sec.

Collapse
31: Re: .LRN v2 Performance (response to 1)
Posted by Ola Hansson on
Hi Nima,

I'm not sure whether ~0.8 secs for ~800 rows is considered still being a problem but I suppose it
can be if the number of members are, say, ten times that of your example.

Anyway, it might be interesting to try and replace the last WHERE clause (the one with the subselect)
with a construct using EXCEPT (Postgres vocabulary) and MINUS (Oracle vocabulary) respectively.

Unless I'm overlooking something here I do believe that an EXCEPT (or MINUS) could be useful here:

EXCEPT acts in the opposite way of a UNION in that it subtracts the rows returned by the second query part
from those returned by the first query part instead of adding them like UNION does. I think it would be analogous to
the current "not in (*subselect*)", which is also aimed to reduce the number of rows, but probably "EXCEPT" does it faster.

Just a thought.

Collapse
32: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
I tried already
- left join
- not exists
with poor results. So testing with EXCEPT/MINUS could be a better alternative. Thank you.
Collapse
33: Re: .LRN v2 Performance (response to 1)
Posted by Andrew Grumet on
set member_url [acs_community_member_url -user_id $user_id]

to

set member_url "[dotlrn::get_url]/community-member?user_id=$user_id"

Nima, this will break Sloan, and anyone else that has overridden the default CommunityMemberURL in the kernel parameters. This should be clear from the code.

acs_community_member_url needs to be fixed, or made more general.

Collapse
34: Re: .LRN v2 Performance (response to 1)
Posted by Andrew Grumet on
Alternatively, we could rewrite things to generalize in dotlrn's community-member page.  But this needs to be configurable, and there need to be upgrade scripts to take care of sites that are relying on the CommunityMemberURL kernel parameter.

I don't know if Tracy's cut the beta yet, but if you committed this, please revert it asap.

Collapse
35: Re: .LRN v2 Performance (response to 1)
Posted by Nima Mazloumi on
Alright,

here is the solution:

Moved out the part

"[subsite::get_element -element url -notrailing][ad_parameter -package_id [ad_acs_kernel_id] CommunityMemberURL]"
from acs_community_member_url into a new proc called:
ad_proc -public acs_community_member_page {} {  
     @return the url to the community member page  
 } { 
     return "[subsite::get_element -element url -notrailing][ad_parameter \ 
             -package_id [ad_acs_kernel_id] CommunityMemberURL]" 
 }
which is now used by acs_community_member_url internally:
return "[acs_community_member_page]?[export_vars user_id]" 
now the member_url is generated by this. Outside db_multirow we get the member_page only once:
set member_page [acs_community_member_page]
and then add the user_id for a given user inside db_multirow:
set member_url "$member_page?user_id=$user_id"
This shouldn't break anything. I committed that to oacs-5-1