Forum OpenACS Q&A: Response to Database errors after power fails

Collapse
Posted by Don Baccus on
For starters, all RDBMS systems are plagued with potential failure problems when there's a power outage depending on how your hardware is  configured.

A nasty "feature" of many modern disks is that they buffer writes, and  return "success" to the operating system when data makes it to the BUFFER rather than waiting until it makes it to the PLATTER.  The OS then has no way of knowing exactly when the data is truly safely stuffed onto the physical platter.

Then the OS tells the RDBMS that the data's successfully written and the RDBMS by necessity believes what it is told.

Of course, disk manufacturers are interested in maximizing performance  so generally ship disks from the factory set up in this mode.

This is insidious and fixing it requires that you check those little jumpers stuck into the back of each and every drive in your system, with each manufacturer having a different jumper scheme (which often changes within each manufacturer's disk line), etc.

Expensive disk arrays have enough battery backup to ensure that all data can physically be written before shutting down.  If you have a UPS hooked up to your box and if you've taught the system to shutdown the database when power fails and before the battery runs out, you'll also circumvent the problem.

Having passed along this truly depressing packet of information, I'll have to say that this doesn't look like the cause of your problem.  You're on the right track running pg_dump and following it with a vacuum.  The table that's missing is found in notifications.sql so it should be there - maybe you are seeing a subtle form of corruption, I hope not!

You should really upgrade to PG 7.1 BTW, it's considerably more robust  (mostly due to bug fixes).