Forum OpenACS Q&A: How to go about debugging a Postgres performance problem?
They have a dev server set up in their office, and their live site runs on one of ours. Their dev box is newer and therefore faster, and clearly less busy, so it's not surprising that things run faster there. But it's not that simple.
The client reports that this:
explain analyze update mytable set mycolumn = '' where id = 123;
Ran in 0.696 ms on his dev box and 25.825 ms on the production box. Looks bad. But then he ran vacuum full analyze on his dev box and the time for the query *increased* to 34.859 ms. Given this, I'm thinking it's not a hardware problem, it's something going on with Postgres. And that's where I get stuck.
The plan is the same for the query in each case, and the estimated costs are almost the same, but the actual elapsed time is much different.
We have already followed the Postgres tuning recommendations and made sure that both boxes are roughly the same as far as shared memory, buffers, etc goes. I've looked around for tools or even articles on PG that might shed some light on what else we can do, but I'm not finding much.
It seems to me that it's got to have something to do with Postgres, if running a vacuum analyze could adversely affect performance by that much. But I'm definitely open to hearing reasons why that's a false assumption. ;)
As I understand it, Postgres does not actually grind through all the values in a table and update the statistics; instead, it samples the some of the values and uses that statistical sample. Thus different runs of analyze might generate different statistics, thus giving you different performance. In the case of the dev server vs. the production server, my guess would be that there is a difference in the way the database is used that is causing PG to calculate statistics differently.
Probably what you should is to increase the statistics value as mentioned in the article above for the affected tables, e.g. "ALTER TABLE ALTER COLUMN SET STATISTICS 100;" where TABLE and COLUMN are the table and column affected. With more values comes slower analyze times, but more accuracy.
Also, just for fun, run "SELECT attname, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 'TABLE';" on both the dev and production server and see what the most common values are... this may give you an insight into "why" it is happening.
It will show you what your pg processes are upto and give you some idea of the load they are imposing although his choice of colours leaves a lot to be desired.
Whilst we are about it pgadmin3 http://www.pgadmin.org/pgadmin3/index.php is also useful for running speed test queries on remote db's etc.
1/ PQA Query time analysis : http://pqa.projects.postgresql.org/
This tool will tell you, over time, which queries are run most often, which are the longes to run etc.
2/ The postgres performance email list : http://archives.postgresql.org/pgsql-performance/
It's moderate traffic, the archives are good, and people will spend time with you problem as long as you post enough info in your question to give them something to go on.
And on the production box there's a couple of other sites running for the same client, when the staging site sits idle all its database blocks are flushed from cache due to activity on the live sites and the size of their databases.
Running the update a couple of times shows the effectiveness of the caching, as the second time runs many times faster!