Forum OpenACS Q&A: Re: PostgreSQL data disappeared

Collapse
Posted by Patrick Giagnocavo on
These are my suggestions:

(Needless to say, make a full backup of all files if doing any file or db modifications.)

First, check the $PGHOME/data/base directory.  Do an 'ls' - perhaps the data is still there, just not accessible at this point in time.

If I recall correctly, each database has its own subdirectory under data/base .  The numbered directories are the OID of the database, I think.

One way (may take a while if you have a lot of data loaded) is to use grep(for text files) and the "od -a" or "strings" command (for binary files) to look through the various database files provided they still exist.

If you know a word or phrase that could only be in the database that is missing, grep through the various files there and see if you get a match.

Second, if the databases are present, but simply not recognized, the information may still be in the system catalog but messed up somehow.

Use pg_dump to dump out the information in text form and look at what there is.  Perhaps you can decipher a way to clean up the system catalog and get things working again.  No doubt the experts on the postgresql mailing lists will be able to assist.

The next question concerns your OS.  I assume you are running Linux - check your kernel version and OS for updates.  I know that I was bitten by a very bad bug in a now-obsolete distribution - the "bdflush" daemon that calls fsync() every 30 seconds was not running when the machine was booted, so changes I made would be recorded in the disk cache in RAM, but never actually written to the drive.

You may want to look through the "messages" log file and other logs in search of clues as to why this happened.  Usually the previous kernel messages from the next-to-last boot are recorded.

Finally, you should test your hardware, now that you know that it has failed.  The usual test is to compile a kernel, since it stresses the disk , CPU and memory subsystems.

Another alternative that has worked well for me is the Mersenne prime number detector, which heavily stresses CPU and memory subsystem - see mersenne.org and read up on the torture test setting.

Best of luck!

Collapse
Posted by Jarkko Laine on
Patrick,

There is three subdirs in base directory, of which, I think, two first are template0 and 1 and the last one the only real database left. So it seems that two databases are totally lost from the data directory. Fortunately they were just test servers.

What's more unfortunate, dumping the only database gives just the database creation phase, so it seems to have forgotten that there's been any new data after that.

I'll get into the OS and hardware stuff later, now I have to clean my room and fly back home for Xmas.

I was so pissed of for all of this that I hacked the nightly PostgreSQL backup script to use scp instead of ftp. It seems you have to first lose three months worth of blog entries before you really get the important things done. I'll share the script on Next Steps page as soon as I can see it working.