Forum OpenACS Q&A: What expectations/settings should I have for Postgres?

I am using Postgres 7.0 on a dual PIII 750 with 1 gig of ram and 1
gig of swap space. I am running a rather expensive query on a table
with around 30 columns and 2121429 rows. Postgres is started with the
command:

postgres -c '/usr/local/pgsql/bin/postmaster -o -F -S 16384 -i -B512 -
D /var/pgsql

The query I ran is a real odd ball:

select count(distinct(time || extract(month from date) || extr
act(day from date) || extract(year from date) || sid || membid)),
sid, membid, date, time from logs group by membid, sid, date, time
having count(*) > 1;

I wound up getting an error saying that postgres was out of memory
and lost the connection to the database which it later recreated. No
other queries then the one above were running at the time. I watched
top throughout the running of the query, after a couple minutes the
query was taking up 98% of the CPU time. During this time the Size
grew until 1690M and the RSS never went higher then 793M

I also ran the query:

select count(membid) from logs;

It took about 20 seconds to complete, I have no idea why it is taking
so long, or maybe it should and my expectations for this server
configuration are too high.

I am left trying to figure out where my problem lies. How does one
figure out if they have enough ram or the right configuration for
postgres? Any suggestions? The last time I had a 3million plus row
table I had a loaded Sun E450, maybe I need one here.

This is not an openACS issue, although I am an avid fan (slowly
getting moving my ACS servers to openACS when time permits). I just
figured since I have seen openACS tables get pretty big someone here
might have some advice.

Thanks,

Jeff Barrett

A couple suggestions off the top of my head:

what does explain show for your queries?  For count(*) in particular it should be able to use any unique index and not read each actual block of the table.  For the first query I think you are out of luck since the string concatenations will make the optimizer throw up its hands and give up.

Is there a way you can get distinct rows in the first query without actually using distinct?  I can see how sorting 2M rows might be hard on the DB.

does the count(*) improve after you run it a second time, i.e. after presumably the data is already in memory?

what options are you starting postmaster with?  The defaults suck.

As to the bug in the first query, all I can think of is that they definitely fix bugs between versions; maybe you should upgrade. :)

Have you done a "vacuum analyze"?

Prefix your query with "Explain" and you will see the query plan it decided upon. From the looks of it, it will generate an **incredible** amount of data before moving on to GROUP and then HAVING. Particularly with 2.1M rows.

I don't know your data model, but I can't help but wonder if the query couldn't be simplified. Do you have indices created on functions? See CREATE INDEX documentation, see the ON phrase.

There have been some in-statement memory leak issues surrounding functions, i.e. memory wasn't freed until the statement (rather than the function) was complete. I believe that many/most/all of these were fixed for PG 7.1 so as always I recommend you try PG 7.1.

If it still runs out, pass it along as a bug as I suspect the concatenate, as suggested, is not freeing memory until statement end. Which in this case is too late in this case.

As far as count goes, no, PG can't just use the unique index. This is a weakness in the PG indexing model. PG indexes don't carry transaction data, so each row in the table has to be "poked at" to see if it is still live for the current transaction. Adding transaction data to the index table is a double-edged sword - the index tables themselves become larger, updating them slower, and it becomes a lot harder for users to write new index types (unless the PG framework was expanded to accomodate both transaction-aware and transaction-stupid indexes).

This is, I suspect, why MySQL shines in their favorite benchmark - "count(*)". Traditional MySQL doesn't have transactions so of course they never had to deal with the problem or the implications! So they may just count the index hits.

Why can't you just do this?

select count(*), sid, membid, date, time
from logs
group by sid, membid, date, time
having count(*) > 1
Isn't the concat already unique after the group operator?
512 * 8192 is only 4 MB.  You're only giving it 4 MB for its shared buffer cache and you've got 1GB of RAM???  How miserly of you! :)

You've got two million rows and 30 columns per row - try giving it enough shared buffer space so the whole table fits in memory.  The first count(*) won't be any faster, but subsequent ones will.

PG is particularly stupid about sequential scans and its LRU cache.  It turns out that if the table is just one block longer than your buffer cache, then each sequential scan will read every block of the table from the disk, no blocks will come from cache.  Oracle and probably other commercial dbs (at least some of them) try to mitigate against this disaster by limiting the use of cache for sequential scans, but PG doesn't.

So it is extremely important to have your buffer cache larger than your largest table if at all possible, at least if you expect to do things like count the entries.

Yes, the filesystem buffers too, but PG still has to read/write the blocks from the buffer cache and this involved library and kernel overhead, and of course the memory bandwidth to shuffle those bytes back and forth between the filesystem cache and PG's buffer cache.

16 MB for your sort buffers is too low, too, since the "group by" sorts the result table in order to pick out the duplicates as it groups.

Of course, your server may already be devoting all its memory to something serious like whomping up web pages like crazy, in which case  you'll have to live with slower query times for your log reports.

Actually indexes won't help these queries - he's looking at the entire table.  Indexes only help when you're slicing bits out of a table or in some RDBMS systems (but not PG) when you're doing queries that only  involve the column(s) in the index.

Is there a way you can get distinct rows in the first query without actually using distinct?

Yes, Don mentioned a way above, the query I used was a nasty mess, I cleaned it up and got it to work, but I wanted to learn what was going on and learning I am!

does the count(*) improve after you run it a second time, i.e. after presumably the data is already in memory?

No it does not seem to improve. Overall on this system when I run queries they do not seem to be any faster the next time around. It seemed to improve a bit with a higher value for the -B setting for PG.

Have you done a "vacuum analyze"?

Yep did that, do it every night too.

Do you have indices created on functions?

No indices on this table yet, the table does not have a primary key yet. This query is step 1 in fixing that problem. I am not sure how one would place an index on a function.

Why can't you just do this? select count(*), sid, membid, date, time from logs group by sid, membid, date, time having count(*) > 1

I did do that and it ran, I was just very curious as to why it failed. It took a long time, about 3 minutes to run the query above but that is fine.

The whole reason I had to run this query is because this table has no primary key and I needed to figure out how to select every distinct row. Winds up there is no way to distiguish a row, there are massive amounts of duplicate values, so time for me to fix that problem. This is the joy of adopting an ill designed table, now I get to clean it up.

I never realized how poorly configured this box was for PG. I have done what Don suggested and moved the shared memory buffer up to 3000 * 8192, 24mb. Evidently the kernel for this box is only set to handle 24mb or so as a max. I really want to learn more about this, it seems very important. Now that I am getting more and more into PG the more I want to understand it. Where can I get the in depth details on the -S and -B swicthes for PG and the linux kernel settings? I need some resources on what these different memory areas are for etc.

Thanks for the advice, it was great.

-Jeff