Forum OpenACS Q&A: How to go about debugging a Postgres performance problem?

We have a situation with one of our hosting clients and I'm not quite sure how to proceed.  I would know what to do if it was Oracle, but I can't find much in the way of tools to debug these things in Postgres.

They have a dev server set up in their office, and their live site runs on one of ours.  Their dev box is newer and therefore faster, and clearly less busy, so it's not surprising that things run faster there.  But it's not that simple.

The client reports that this:

explain analyze update mytable set mycolumn = '' where id = 123;

Ran in 0.696 ms on his dev box and 25.825 ms on the production box.  Looks bad.  But then he ran vacuum full analyze on his dev box and the time for the query *increased* to 34.859 ms.  Given this, I'm thinking it's not a hardware problem, it's something going on with Postgres.  And that's where I get stuck.

The plan is the same for the query in each case, and the estimated costs are almost the same, but the actual elapsed time is much different.

We have already followed the Postgres tuning recommendations and made sure that both boxes are roughly the same as far as  shared memory, buffers, etc goes.  I've looked around for tools or even articles on PG that might shed some light on what else we can do, but I'm not finding much.

Any suggestions?

What about a disk I/O difference. I am not sure how much impact on the updates it would have, but it seems a likely place to look.
Mike just headed out the door for lunch, and on his way he stopped by to tell me that he'd run some timing tests on the production server and that disk performance seems just fine.  Hopefully he'll jump in and tell us what he did.  But anyway, it doesn't look like that's the problem.

It seems to me that it's got to have something to do with Postgres, if running a vacuum analyze could adversely affect performance by that much. But I'm definitely open to hearing reasons why that's a false assumption. ;)

Thanks!

A possibility is that the statistics are not accurate enough, so when the vacuum analyze part is done, sometimes PG gets accurate stats and sometimes it does not.

See this:

http://www.varlena.com/varlena/GeneralBits/89.php
(second subject)

As I understand it, Postgres does not actually grind through all the values in a table and update the statistics; instead, it samples the some of the values and uses that statistical sample.  Thus different runs of analyze might generate different statistics, thus giving you different performance.  In the case of the dev server vs. the production server, my guess would be that there is a difference in the way the database is used that is causing PG to calculate statistics differently.

Probably what you should is to increase the statistics value as mentioned in the article above for the affected tables, e.g. "ALTER TABLE ALTER COLUMN SET STATISTICS 100;" where TABLE and COLUMN are the table and column affected.  With more values comes slower analyze times, but more accuracy.

Also, just for fun, run "SELECT attname, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 'TABLE';" on both the dev and production server and see what the most common values are... this may give you an insight into "why" it is happening.

If you want to see what Postgres is upto you could try pgmonitor http://gborg.postgresql.org/project/pgmonitor/projdisplay.php

It will show you what your pg processes are upto and give you some idea of the load they are imposing although his choice of colours leaves a lot to be desired.

Whilst we are about it pgadmin3 http://www.pgadmin.org/pgadmin3/index.php is also useful for running speed test queries on remote db's etc.

    - Steve

Two quick pieces of advice:

1/ PQA Query time analysis : http://pqa.projects.postgresql.org/

This tool will tell you, over time, which queries are run most often, which are the longes to run etc.

2/ The postgres performance email list : http://archives.postgresql.org/pgsql-performance/

It's moderate traffic, the archives are good, and people will spend time with you problem as long as you post enough info in your question to give them something to go on.

Without seeing the plan I would also guess that the statistics are misleading the planner into poor choices.
I looked into this for Janine and the problem is simply one of the buffer, Linux disk, and disk drive caches being flushed of the "mytable" blocks during the VACUUM FULL (there are many large tables on this box).

And on the production box there's a couple of other sites running for the same client, when the staging site sits idle all its database blocks are flushed from cache due to activity on the live sites and the size of their databases.

Running the update a couple of times shows the effectiveness of the caching, as the second time runs many times faster!