Forum OpenACS Q&A: pg miscellanea

Posted by Jonathan Ellis on
pgsql-general is eating my mail w/o either posting it or sending me an error message, so I'll post here instead. :0

First, a workaround if you get "Cannot insert a duplicate key into unique index pg_statistic_relid_att_index" during vacuum analyze in 7.2: other people have had this problem and Tom Lane said basically you have to shut down the database and use a standalone backend to REINDEX. Shutting down sucks on a live system, so in the spirit of it-can't-hurt-to-try I found which specific table was having issues, dropped, and recreated it. Vaccum was happy again. Obviously if the table has too many constraints to recreate by hand, this is even worse thann the REINDEX solution but for me in this case it was far better.

Second, a couple questions.

I'm experiencing a VERY slow delete on my largest table (250k rows). I trim the table every day to prevent it from getting unmanagably large:

delete from battles where date_fought < ago('48 hours'::interval)
but this is ENORMOUSLY slow. even after running this once, then vacuum analyze (and reindex just to be sure), the planner thinks it's going to hit 180k rows with this when run a second time, when really only 500-1000 rows will be. explain says:
Seq Scan on battles  (cost=0.00..28305.75 rows=182361 width=6)
If I set enable_seqscan=off, it does use index scan, but is not noticably faster. I have shared memory segment set to 256M, and the largest SHARE segment reported by top is 120M. But that the index scan is slow (time in minutes, not seconds) when run twice in a row would seem to indicate that the index isn't staying in memory. I'm baffled. (Note that I've created an iscachable function "ago" to avoid the problems with directly using now() in such a query.)

Finally, related to the above: date_fought is declared timestamp, but in 7.2 that includes ms. Anyone know if it is faster to use timestamp(0) when you don't need the extra precision?

2: Response to pg miscellanea (response to 1)
Posted by David Walker on
Assuming you have a battles_id field, how does the speed compare when doing
select max(battles_id) from battles where date_fought <  
now()::date - '48 hours'::interval;  
delete from battles where battles_id < $battles_id   
(That also makes the cutoff midnight instead of 48 hours prior to the current time due to the ::date line)
3: Response to pg miscellanea (response to 1)
Posted by Jonathan Ellis on
that's what I'll have to try if nothing else works, but currently I have no need for an int id.
4: Response to pg miscellanea (response to 1)
Posted by David Walker on
how about
delete from battles where date_fought < ( select now()::date -  
interval('48 hours') )  
Does that still cause a seq scan? is the index scan still slow?
5: Response to pg miscellanea (response to 1)
Posted by russ m on
um, are you sure that ago() should be declared iscachable? the
result returned by ago() presumably depends on more than just
the parameters passed in (ie the current time when the call was

I don't see how that would cause the problem you're seeing, but
it still looks wrong to me...

6: Response to pg miscellanea (response to 1)
Posted by Jonathan Ellis on
David: after moving PG to a separate disk (see my thread about forces unknown hitting my root fs :) both this and my original query run fast with enable_seqscan turned off.  I didn't think that would be such a big factor but apparently it was.

Russel: yes, this is how it's supposed to be.  W/o iscachable, PG's query planner recomputes function calls for each row (which is what David was working around with the subselect above).  Even if you set enable_seqscan=off in such a case, EXPLAIN will still output a seq scan plan.  I'm not sure how PG knows to make this function not "globally cachable", but if you do select ago('1 minute') twice, you will get two different values.