Forum OpenACS Q&A: PostgreSQL data disappeared
After installing ora8 database I tried according to Bruno's instructions to reboot the server to test if Oracle starts on boot. Well, the server didn't reboot. What makes everything a bit more tricky is that the server is in Finland and I'm in Switzerland.
So I asked some help and my colleagues back home discovered that the shutdown process hadn't gone through, but the umount of /usr had resulted in an illegal seek. They made a hard reboot and the server had booted fine when I came to school today. Or so I thought...
My suspicions woke up when trying to load front page of my site welcomed me to install OpenACS. A dotlrn installation in another port said it couldn't connect to database.
Trying to launch psql revealed the dark truth: two of the (test) databases had totally disappeared, and the only one left was empty:
suisse-dev=# \d No relations found. suisse-dev=#I'm totally knocked out. What on earth can have happened?
NOTICE: identifier "acs_object_util__get_object_type" will be truncated to "acs_object_util\ __get_object_typ" NOTICE: identifier "acs_object_util__get_object_type" will be truncated to "acs_object_util\ __get_object_typ" DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection FATAL 1: user "root" does not exist Lock file "/usr/local/pgsql/data/postmaster.pid" already exists. Is another postmaster (pid 1075) running in "/usr/local/pgsql/data"? DEBUG: smart shutdown request DEBUG: shutting down DEBUG: database system is shut down Lock file "/tmp/.s.PGSQL.5432.lock" already exists. Is another postmaster (pid 32391) using "/tmp/.s.PGSQL.5432"? /usr/local/pgsql/bin/postmaster: cannot create UNIX stream port DEBUG: database system was shut down at 2002-09-16 16:30:24 EEST DEBUG: checkpoint record is at 0/13FA0A74 DEBUG: redo record is at 0/13FA0A74; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 1071418; next oid: 216948 DEBUG: database system is ready FATAL 1: Database "testo-dev" does not exist in the system catalog. FATAL 1: Database "dotlrn-test" does not exist in the system catalog. ERROR: Relation "users" does not exist FATAL 1: Database "dotlrn-test" does not exist in the system catalog. Lock file "/usr/local/pgsql/data/postmaster.pid" already exists. Is another postmaster (pid 1219) running in "/usr/local/pgsql/data"? DEBUG: smart shutdown request DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: shutting down FATAL 1: The database system is shutting down FATAL 1: The database system is shutting down FATAL 1: The database system is shutting down DEBUG: database system is shut down DEBUG: database system was shut down at 2002-12-18 13:57:42 EET DEBUG: checkpoint record is at 0/13FA6C98 DEBUG: redo record is at 0/13FA6C98; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 1071434; next oid: 216949 DEBUG: database system is ready DEBUG: fast shutdown request DEBUG: shutting down DEBUG: database system is shut down DEBUG: database system was shut down at 2002-12-18 14:05:05 EET DEBUG: checkpoint record is at 0/13FA6CD8 DEBUG: redo record is at 0/13FA6CD8; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 1071437; next oid: 216949 DEBUG: database system is ready FATAL 1: Database "nsadmin" does not exist in the system catalog.dstem is shutting down FATAL 1: The database system is shutting down FATAL 1: The database system is shutting down DEBUG: database system is shut down DEBUG: database system was shut down at 2002-12-18 13:57:42 EET
When they did the reboot, the postmaster.pid file was not removed. This means the postmaster process didn't startup as usual.
First check to verify that postmaster isn't running:
$ps axww|grep post
if you don't see a process for postmaster with your data directory, then you need to remove the postmaster.pid file and try starting it again.
What I don't get is that you were following Bruno's instructions on Oracle. Are you installing both databases?
The problem with the Oracle install is that the startup files in /etc/rc.d/init.d/oracle8i
shut Oracle down after sshd goes down, and also the websites are usually still running. If the webserver is still running, Oracle will never shut down. The local keyboard will already not work, and you can no longer log in via ssh to cure the error. The only thing you can do is to power cycle the machine.
So, how do you control the webserver to ensure it goes down before Oracle? Also, what are the links in /etc/rc.d/rc6.d
that contain oracle8i and sshd?
If you use daemontools (svc) to shut down the webservers, I have an init.d script that does that first on shutdown of the server. I'll post some better instructions later today.
I also found bugs in the init.d scripts of listener8i and postgresql: both fail to create a lock file like the Oracle8i script, which means that the processes are not correctly stopped when you shutdown or reboot.
The problem is that postmaster was running. I was able to connect to one of three databases there should have been, but it was empty.
And yes, I was installing Oracle for testing of dotLRN for my school, my own site uses PostgreSQL (or used, now it doesn't exist anymore).
I don't think I had any scripts for AOLserver in rc.d:
[root@swissnet rc6.d]# ls K02postgresql K25squid K49oracle8i K70bcm5820 K86nfslock K03rhnsd K25sshd K50snmpd K72autofs K87portmap K05anacron K30sendmail K50snmptrapd K74apmd K88syslog K05atd K34yppasswdd K50tux K74nscd K90network K05keytable K35smb K50xinetd K74ntpd K92ipchains K10wine K44rawdevices K60crond K74ypserv K92iptables K10xfs K45winbind K60lpd K74ypxfrd K95kudzu K15gpm K46radvd K65identd K75netfs S00killall K20nfs K49listener8i K70aep1000 K80random S01rebootBut I do use svc to control web servers. It's just this was probably the first time I had to reboot the machine...
(Needless to say, make a full backup of all files if doing any file or db modifications.)
First, check the $PGHOME/data/base directory. Do an 'ls' - perhaps the data is still there, just not accessible at this point in time.
If I recall correctly, each database has its own subdirectory under data/base . The numbered directories are the OID of the database, I think.
One way (may take a while if you have a lot of data loaded) is to use grep(for text files) and the "od -a" or "strings" command (for binary files) to look through the various database files provided they still exist.
If you know a word or phrase that could only be in the database that is missing, grep through the various files there and see if you get a match.
Second, if the databases are present, but simply not recognized, the information may still be in the system catalog but messed up somehow.
Use pg_dump to dump out the information in text form and look at what there is. Perhaps you can decipher a way to clean up the system catalog and get things working again. No doubt the experts on the postgresql mailing lists will be able to assist.
The next question concerns your OS. I assume you are running Linux - check your kernel version and OS for updates. I know that I was bitten by a very bad bug in a now-obsolete distribution - the "bdflush" daemon that calls fsync() every 30 seconds was not running when the machine was booted, so changes I made would be recorded in the disk cache in RAM, but never actually written to the drive.
You may want to look through the "messages" log file and other logs in search of clues as to why this happened. Usually the previous kernel messages from the next-to-last boot are recorded.
Finally, you should test your hardware, now that you know that it has failed. The usual test is to compile a kernel, since it stresses the disk , CPU and memory subsystems.
Another alternative that has worked well for me is the Mersenne prime number detector, which heavily stresses CPU and memory subsystem - see mersenne.org and read up on the torture test setting.
Best of luck!
There is three subdirs in base directory, of which, I think, two first are template0 and 1 and the last one the only real database left. So it seems that two databases are totally lost from the data directory. Fortunately they were just test servers.
What's more unfortunate, dumping the only database gives just the database creation phase, so it seems to have forgotten that there's been any new data after that.
I'll get into the OS and hardware stuff later, now I have to clean my room and fly back home for Xmas.
I was so pissed of for all of this that I hacked the nightly PostgreSQL backup script to use scp instead of ftp. It seems you have to first lose three months worth of blog entries before you really get the important things done. I'll share the script on Next Steps page as soon as I can see it working.
Yeah, your boot/reboot scripts are messed up just like mine were. Sshd goes down before Oracle, so if it hangs you can't login and fix it.
Also, I can't tell but if you used the stock postgresql scripts and the stock listerner8i script, they never actually are killed.
First to correct the order of kill/startup problem:
#!/bin/sh # # chkconfig: 345 98 02 # description: Oracle8i starts the oracle database deamons ...
Note changes to the chkconfig line to kill on 02 (almost first) and start on 98 (almost last). Then as root:
# chkconfig --del oracle8i # chkconfig --add oracle8i
Repeat the same steps for listener8i and postgresql. Also these scripts probably are broken. I'll show you my svc script so you can see what needs to be in them to work correctly.
#! /bin/sh # chkconfig: 345 99 01 # description: svc start and kill all # This is an example of a start/stop script for SysV-style init, such # as is used on Linux systems. You should edit some of the variables # and maybe the 'echo' commands. # # add to starup via: # chkconfig --add svc # # The path that is to be used for the script PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin # What to use to start up the postmaster DAEMON="/usr/local/bin/svc" # Only start if we can find svc. test -f $DAEMON || exit 0 # Parse command line parameters. case $1 in start) echo "Starting All SVC Daemons:" $DAEMON -u /service/* touch /var/lock/subsys/svc echo "ok" ;; stop) echo -n "Stopping All SVC Daemons: " svc -d /service/* rm -f /var/lock/subsys/svc echo "ok" ;; restart) echo -n "Restarting All SVC Daemons: " $0 stop $0 start echo "ok" ;; status) svstat /service/* ;; *) # Print help echo "Usage: $0 {start|stop|restart|status}" 1>&2 exit 1 ;; esac exit 0
Add the svc script (as root):
# chkconfig --add svc
Notice the touch
and rm -f
lines in start and stop parts of the script. These lines were missing in my copies of listener9i and postgresql.
Also note that my script is a little aggressive, starting all scripts in /service/
, even ones with the down
file in their run
directory. There is a definite need for improvement here for it to work according to original design.