Forum OpenACS Q&A: optimizing query with aggregates

Posted by Jonathan Ellis on 11/27/01 04:57 PM

My schema deals with "parties" that fight "battles," and I want to see what percent of battles each party has won. The battles table looks like this:

create table battles (
       attacker             integer not null references parties(id),
       defender             integer not null references parties(id),
       winner               integer not null check (winner in 
(attacker, defender, 0)),
       ...
);

I made a view for the information I want so I don't have to reproduce the query on more than one page:

create view won_percent_raw_v as
select p.id, coalesce(
    (select 100*sum(case when winner = p.id then 1 else 0 end) / count
(*)
    from battles b
    where p.id in (b.attacker, b.defender))
    , 0) as won_percent_raw
from parties p;

This is taking roughly 0.1 seconds per aggregate, or 5 seconds for the 50 parties (and 60k battles). Here's the explain:

explain select * from won_percent_raw_v;
NOTICE:  QUERY PLAN:

Seq Scan on parties p  (cost=0.00..1.43 rows=43 width=4)
  SubPlan
    ->  Aggregate  (cost=1472.20..1472.20 rows=1 width=4)
          ->  Index Scan using battles_attacker_i, battles_defender_i 
on battles b  (cost=0.00..1460.21 rows=2397 width=4)
    ->  Aggregate  (cost=1472.20..1472.20 rows=1 width=4)
          ->  Index Scan using battles_attacker_i, battles_defender_i 
on battles b  (cost=0.00..1460.21 rows=2397 width=4)

I'm wondering if there is a better way to do this, because it's obviously not scaling very well. (Select * is actually a query I need to do to compute the global stats, so I can't just narrow down which parties I'm computing.) I did try getting rid of the subselect and instead doing

reate view won_percent_raw_v as
select p.id,
       coalesce(sum(case when winner = p.id then 1 else 0 end) / count
(*), 0)
from parties p, battles b
where p.id in (b.attacker, b.defender)
group by p.id;

but that gives the same query plan with an added sort due to the group by.

2: Response to optimizing query with aggregates (response to 1)

Posted by Patrick Giagnocavo on 11/27/01 05:05 PM

What indexes have you created?

3: Response to optimizing query with aggregates (response to 1)

Posted by Jonathan Ellis on 11/27/01 05:32 PM

create index battles_attacker_i on battles(attacker);
create index battles_defender_i on battles(defender);

4: Response to optimizing query with aggregates (response to 1)

Posted by Robert Locke on 11/27/01 05:53 PM

It sounds like an expensive query, but I'm sure the big brains in this group can help you optimize it.

Perhaps this is not the answer you wish to hear, but another option would be for you to maintain the aggregate information PER USER through the use of triggers (say, the number of games they've played and won, or simply, their winning percentage). Sure, the information would be redundant, and, in an ideal world, unnecessary, but I'm sure it would vastly improve your aggregate reports.

Rob

5: Response to optimizing query with aggregates (response to 1)

Posted by Don Baccus on 11/27/01 06:52 PM

It's not a bad plan - note that it is using the btree index to grab the right rows from your battles table. You might time the sum and count separately, my guess is that the count portion's noticably faster but it would be interesting to see if that's true.

Why are you joining on parties instead of users? It won't make much difference unless or until you start using groups, etc but I assume you're tracking results for individual users so you should use that table.

Since you're computing this for every user who participates in at least one battle, obviously it's not going to scale very well no matter what. Presuming that all your users participate in battles, there will be one row returned for each of your users, meaning that the time spent will be linearly proportional to the number of users.

So even if it only took 0.01 seconds, 500 users would eat up the same five seconds now taken by 50 users.

One possible workaround is to use "limit" and "offset" to paginate your output, limiting the number of users displayed per page and incidently also placing a ceiling on the number of rows computed per query, making the page display time not dependent on the number of users in the system.

Are you sure the GROUP BY plan is the same? I would expect a single index scan on battles with a sort rather than the two index scans. This may or may not be faster (or slower) but is worth timing, at least.

60K rows of battles - are you sure your shared buffer space is large enough to contain the tables during the query? If not, even though your system probably has enough RAM for the OS filesystem cache to avoid disk I/O you'll chew up a fair amount of time shuffling data between the PG and filesystem caches.

Robert's answer's probably the right one, though ... by using a trigger to maintain a summary table you're distributing the cost of keeping track of this over each update to the battle table, trading off a bunch of inexpensive operations in order to avoid one big one.

6: Response to optimizing query with aggregates (response to 1)

Posted by Jonathan Ellis on 11/27/01 07:14 PM

I'm joining on parties instead of users b/c each user can have multiple parties.

You're right about the group by plan being different; I killed my old emacs buffer, so I'm not sure what I was doing different. But here's what I'm getting now:

Subquery Scan won_percent_raw_v  (cost=0.00..94421.44 rows=10874 width=16)
  ->  Aggregate  (cost=0.00..94421.44 rows=10874 width=16)
        ->  Group  (cost=0.00..93334.01 rows=108743 width=16)
              ->  Nested Loop  (cost=0.00..93062.15 rows=108743 width=16)
                    ->  Index Scan using parties_pkey on parties p  (cost=0.00..258.47 rows=43 width=4)
                    ->  Seq Scan on battles b  (cost=0.00..1173.49 rows=65649 width=12)

In my experience nested loops are to be avoided when at all possible and this does seem to be somewhat slower. Although I'm not sure why it does a seq scan here instead of the index scan it does with the subquery. (I doublechecked after vacuum analyze in case my playing with index creation/destruction were confusing it and I get the same results in both cases.)

But as you say, it's going to be less expensive in any case to track the info separately... I do an update to Parties after each battle anyway, so adding battles_won and battles_fought to that table won't add significant overhead. It is uglier, though. :(

7: Response to optimizing query with aggregates (response to 1)

Posted by Don Baccus on 11/27/01 08:25 PM

Hmmm...yes, that plan doesn't look as nice as the previous one. Interesting, isn't it?