Forum OpenACS Development: Slow queries on Bug-Tracker. Some SQL Tunning

Hi everybody,

I've been lately working on bug-tracker to make it more accessible to users, as we are using it as an error report tool. However, in the way things are now, it's completely impossible for them to just access the index page, because it's taking incredible 50 seconds (yes, fifty seconds) to load.

In order to improve it, I've just recently upgraded to the newest version on HEAD. I've found some small bugs, such as upgrade scripts on wrong folders, and some other issues, and I've also realized someone has been doing some work on it. There were three very slow queries, wich seem to be reduced to only one right now. The query I'm talking about must look like this:

select q.*,
km.keyword_id,
assign_info.*
from (
select b.bug_id,
b.project_id,
b.bug_number,
b.summary,
lower(b.summary) as lower_summary,
b.comment_content,
b.comment_format,
b.component_id,
to_char(b.creation_date,'YYYY-MM-DD HH24:MI:SS') as creation_date,
to_char(b.creation_date, 'YYYY-MM-DD HH24:MI:SS') as creation_date_pretty,
b.creation_user as submitter_user_id,
submitter.first_names as submitter_first_names,
submitter.last_name as submitter_last_name,
submitter.email as submitter_email,
lower(submitter.first_names) as lower_submitter_first_names,
lower(submitter.last_name) as lower_submitter_last_name,
lower(submitter.email) as lower_submitter_email,
st.pretty_name as pretty_state,
st.short_name as state_short_name,
st.state_id,
st.hide_fields,
b.resolution,
b.found_in_version,
b.fix_for_version,
b.fixed_in_version,
cas.case_id

from bt_bugs b,
acs_users_all submitter,
workflow_cases cas,
workflow_case_fsm cfsm,
workflow_fsm_states st
where submitter.user_id = b.creation_user
and cas.workflow_id = '1500160'
and cas.object_id = b.bug_id
and cfsm.case_id = cas.case_id
and cfsm.parent_enabled_action_id is null
and st.state_id = cfsm.current_state
and cfsm.current_state = '4'

) q
left outer join
cr_item_keyword_map km
on (bug_id = km.item_id)
left outer join
(select cru.user_id as assigned_user_id,
aa.action_id,
aa.case_id,
wa.pretty_name as action_pretty_name,
p.first_names as assignee_first_names,
p.last_name as assignee_last_name
from workflow_case_assigned_actions aa,
workflow_case_role_user_map cru,
workflow_actions wa,
persons p
where aa.case_id = cru.case_id
and aa.role_id = cru.role_id
and cru.user_id = p.person_id
and wa.action_id = aa.action_id
) assign_info
on (q.case_id = assign_info.case_id)
order by bug_number desc

I've put it on explain analyze and this is what I've found:

"Nested Loop Left Join (cost=6549.55..113191.28 rows=221 width=3204) (actual time=900.479..41514.489 rows=35 loops=1)"
" Join Filter: (cas.case_id = assign_info.case_id)"
" -> Nested Loop (cost=4.95..159.62 rows=1 width=2338) (actual time=0.408..38.840 rows=35 loops=1)"
" -> Nested Loop (cost=4.95..158.53 rows=1 width=1186) (actual time=0.403..38.408 rows=35 loops=1)"
" -> Nested Loop (cost=4.95..152.23 rows=1 width=1168) (actual time=0.392..37.839 rows=35 loops=1)"
" -> Nested Loop (cost=4.95..145.99 rows=1 width=1140) (actual time=0.380..37.314 rows=35 loops=1)"
" -> Nested Loop (cost=4.95..139.34 rows=1 width=1136) (actual time=0.365..36.695 rows=35 loops=1)"
" Join Filter: (cfsm.case_id = cas.case_id)"
" -> Nested Loop Left Join (cost=4.95..134.38 rows=1 width=1132) (actual time=0.310..21.372 rows=236 loops=1)"
" -> Nested Loop (cost=4.95..134.05 rows=1 width=1128) (actual time=0.296..20.012 rows=236 loops=1)"
" Join Filter: (cas.object_id = b.bug_id)"
" -> Index Scan Backward using bt_bugs_bug_number_idx on bt_bugs b (cost=0.00..123.79 rows=236 width=1124) (actual time=0.021..0.369 rows=236 loops=1)"
" -> Materialize (cost=4.95..4.96 rows=1 width=8) (actual time=0.000..0.032 rows=236 loops=236)"
" -> Seq Scan on workflow_cases cas (cost=0.00..4.95 rows=1 width=8) (actual time=0.012..0.097 rows=236 loops=1)"
" Filter: (workflow_id = 1500160)"
" -> Index Scan using cr_item_keyword_map_pk on cr_item_keyword_map km (cost=0.00..0.32 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=236)"
" Index Cond: (b.bug_id = km.item_id)"
" -> Seq Scan on workflow_case_fsm cfsm (cost=0.00..4.95 rows=1 width=8) (actual time=0.007..0.056 rows=35 loops=236)"
" Filter: ((parent_enabled_action_id IS NULL) AND (current_state = 4))"
" -> Index Scan using users_pk on users u (cost=0.00..6.63 rows=1 width=4) (actual time=0.012..0.014 rows=1 loops=35)"
" Index Cond: (u.user_id = b.creation_user)"
" -> Index Scan using persons_pk on persons pe (cost=0.00..6.22 rows=1 width=28) (actual time=0.010..0.012 rows=1 loops=35)"
" Index Cond: (pe.person_id = u.user_id)"
" -> Index Scan using parties_pk on parties pa (cost=0.00..6.29 rows=1 width=30) (actual time=0.010..0.013 rows=1 loops=35)"
" Index Cond: (pa.party_id = pe.person_id)"
" -> Seq Scan on workflow_fsm_states st (cost=0.00..1.07 rows=1 width=1156) (actual time=0.003..0.008 rows=1 loops=35)"
" Filter: (4 = state_id)"
" -> Hash Join (cost=6544.60..111857.00 rows=52060 width=458) (actual time=400.595..1184.905 rows=119 loops=35)"
" Hash Cond: (COALESCE(dep.deputy_user_id, u.user_id) = p.person_id)"
" -> Hash Join (cost=4221.18..108297.15 rows=52060 width=434) (actual time=398.557..1182.511 rows=119 loops=35)"
" Hash Cond: ((wcrpm.case_id = wcea.case_id) AND (wcrpm.role_id = wa.assigned_role))"
" -> Hash Join (cost=4183.83..93405.47 rows=716687 width=16) (actual time=398.522..1182.158 rows=472 loops=35)"
" Hash Cond: (pamm.member_id = u.user_id)"
" -> Merge Join (cost=29.68..72230.01 rows=716687 width=12) (actual time=395.941..1178.950 rows=472 loops=35)"
" Merge Cond: (pamm.party_id = wcrpm.party_id)"
" -> Index Scan using party_approved_member_map_pk on party_approved_member_map pamm (cost=0.00..59210.21 rows=1066296 width=8) (actual time=0.014..933.179 rows=1059493 loops=35)"
" -> Sort (cost=29.68..30.86 rows=472 width=12) (actual time=0.013..0.128 rows=472 loops=35)"
" Sort Key: wcrpm.party_id"
" -> Seq Scan on workflow_case_role_party_map wcrpm (cost=0.00..8.72 rows=472 width=12) (actual time=0.009..0.151 rows=472 loops=1)"
" -> Hash (cost=3222.25..3222.25 rows=74552 width=8) (actual time=90.017..90.017 rows=74552 loops=1)"
" -> Hash Left Join (cost=13.00..3222.25 rows=74552 width=8) (actual time=0.028..56.866 rows=74552 loops=1)"
" Hash Cond: (u.user_id = dep.user_id)"
" -> Seq Scan on users u (cost=0.00..2929.52 rows=74552 width=4) (actual time=0.011..23.485 rows=74552 loops=1)"
" -> Hash (cost=12.80..12.80 rows=16 width=8) (actual time=0.002..0.002 rows=0 loops=1)"
" -> Seq Scan on workflow_deputies dep (cost=0.00..12.80 rows=16 width=8) (actual time=0.001..0.001 rows=0 loops=1)"
" Filter: ((now() >= start_date) AND (now() <= end_date))"
" -> Hash (cost=33.74..33.74 rows=240 width=434) (actual time=0.994..0.994 rows=240 loops=1)"
" -> Hash Join (cost=10.13..33.74 rows=240 width=434) (actual time=0.300..0.821 rows=240 loops=1)"
" Hash Cond: (wcea.action_id = wa.action_id)"
" -> Hash Join (cost=7.31..27.62 rows=240 width=12) (actual time=0.212..0.592 rows=240 loops=1)"
" Hash Cond: (wcea.case_id = c.case_id)"
" -> Seq Scan on workflow_case_enabled_actions wcea (cost=0.00..17.01 rows=240 width=8) (actual time=0.013..0.263 rows=240 loops=1)"
" Filter: ((NOT completed_p) AND assigned_p)"
" -> Hash (cost=4.36..4.36 rows=236 width=4) (actual time=0.184..0.184 rows=236 loops=1)"
" -> Seq Scan on workflow_cases c (cost=0.00..4.36 rows=236 width=4) (actual time=0.009..0.068 rows=236 loops=1)"
" -> Hash (cost=2.65..2.65 rows=14 width=430) (actual time=0.075..0.075 rows=14 loops=1)"
" -> Hash Join (cost=1.32..2.65 rows=14 width=430) (actual time=0.044..0.057 rows=14 loops=1)"
" Hash Cond: (wa.action_id = wa.action_id)"
" -> Seq Scan on workflow_actions wa (cost=0.00..1.14 rows=14 width=8) (actual time=0.006..0.008 rows=14 loops=1)"
" -> Hash (cost=1.14..1.14 rows=14 width=422) (actual time=0.020..0.020 rows=14 loops=1)"
" -> Seq Scan on workflow_actions wa (cost=0.00..1.14 rows=14 width=422) (actual time=0.004..0.011 rows=14 loops=1)"
" -> Hash (cost=1391.52..1391.52 rows=74552 width=28) (actual time=71.027..71.027 rows=74552 loops=1)"
" -> Seq Scan on persons p (cost=0.00..1391.52 rows=74552 width=28) (actual time=0.009..24.650 rows=74552 loops=1)"
"Total runtime: 41514.945 ms"

It seems like this query has two big troubles:

1 - The Index Scan on party_approved_member_map is incredibly slow.

2 - The Hash Join with the users table is also taking a long time

Most of the problems come from the huge amount of users we have (around 65.000), but I guess this performance can be improved. I've tried some configurations tuning on postgresql.conf, but I wasn't very successfull. So, can anybody give me a help on that? I believe there are other systems with the same amount of users using bug-tracker, and I don't see how it can be fast enough in order to be useable.

Any help is appreciated.

Collapse
Posted by Dave Bauer on
I think you might need to tune your database.

Or maybe the query only is slow with the large amount of data you have.

How many shared buffers do you have? Also make sure in generaly your server has enough RAM for the operating system to cache the disk reads (on Linux anyway, I am not sure how it works on other platforms).

Genreally around 128mb shared buffers seems to help in many cases, I haven't found a site that needed more, but your results may vary. I would experiment but increasing shared buffers to at least 128mb if is not already.

Collapse
Posted by Eduardo Santos on
Hi Dave,

I don't think I can improve this by tuning. I've already played with the indexes, changed the shared buffers, etc. I've got to the amount of 3GB for shared buffers, wich is close to the best results I've got grom pgbench. The machine is a dedicated IBM Server with 16GB RAM and two Quad Cores (Linux sees it as 8 processors). If there's any way to improve this query with server tuning, it's beyond my knowledge to do it.

That's why I've thought about tuning the query, wich seems to be a complex task. Do you think there's any way it cab be tuned? I couldn't find any looking at the data model.

Collapse
Posted by Dave Bauer on
That's alot of RAM :)

Yeah the complexity of the query is just too much. I don't have any off-hand advice. It looks like it would require a serious analysis of workflow and bugtracker data models.

Which version of PG are you using?, left outer join in the planner prediction is known to be slow in some versions.
Collapse
Posted by Eduardo Santos on
Hi Rocael,

I'm still using 8.2, because I didn't have time to upgrade my OpenACS version to be compatible with 8.3 and beyond.

the problem is the left outer join. However, if we don't use it, the data won't be in the format the list wants to. So, I'm out of ideas.

It seems like there's a view in Worflow wich joins the party_approved_member_map and it's also very slow. I don't see how to improve it without change the whole datamodel for the workflow and bug-tracker.

Collapse
Posted by Nathan Lunt on
I don't have a lot of time to delve into this right now, but I wanted throw this out. Is there a chance that performance may be improved by doing two separate queries and appending the data to the list as necessary? Sorry I can't find the answer for myself, but I have seen situations where this is the case.
Collapse
Posted by Eduardo Santos on
Hi Nathan,

Thank you very much for your answer, but I don't think this is the case. The list is very complicated with filters and pagination, and my guess is that list builder will need all the data in the same multirow.

Anyway, I'll try to follow your suggestion and see what happens.

Collapse
Posted by Dave Bauer on
You might try the geqo_threshhold setting increasing it to 24 or so to see if it speeds up this particular query.

geqo_threshold (integer)

Use genetic query optimization to plan queries with at least this many FROM items involved. (Note that a FULL OUTER JOIN construct counts as only one FROM item.) The default is 12. For simpler queries it is usually best to use the deterministic, exhaustive planner, but for queries with many tables the deterministic planner takes too long.

See: http://www.postgresql.org/docs/8.2/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-GEQO

Collapse
Posted by Dave Bauer on
Here is an interesting post about statistics, you almost definitely need to tune this on a database that large (based on the content of the post, not my experience.)

http://it.toolbox.com/blogs/database-soup/selectively-setting-statistics-sample-size-37565?rss=1

Collapse
Posted by Eduardo Santos on
Hi Dave,

Thank you very much for your tips. PostgreSQL really have some issues that I didn't know about. I was thinking that the optimizer was not taking the best way for this query, but it's very difficult to find this out.

I'm out of access to the database right now, but as soon as I can I'll do this tuning and post the results back here.

Thank you very much again.