Forum OpenACS Q&A: PostgreSQL data disappeared

Collapse
Posted by Jarkko Laine on
I already thought that my so far unfortunate Oracle installation process finally ended happily. Too early.

After installing ora8 database I tried according to Bruno's instructions to reboot the server to test if Oracle starts on boot. Well, the server didn't reboot. What makes everything a bit more tricky is that the server is in Finland and I'm in Switzerland.

So I asked some help and my colleagues back home discovered that the shutdown process hadn't gone through, but the umount of /usr had resulted in an illegal seek. They made a hard reboot and the server had booted fine when I came to school today. Or so I thought...

My suspicions woke up when trying to load front page of my site welcomed me to install OpenACS. A dotlrn installation in another port said it couldn't connect to database.

Trying to launch psql revealed the dark truth: two of the (test) databases had totally disappeared, and the only one left was empty:

suisse-dev=# \d
No relations found.
suisse-dev=# 
I'm totally knocked out. What on earth can have happened?
Collapse
Posted by Jarkko Laine on
Another odd thing is, that server.log file doesn't give too much information (at least to me):
NOTICE:  identifier "acs_object_util__get_object_type" will be truncated to "acs_object_util\
__get_object_typ"
NOTICE:  identifier "acs_object_util__get_object_type" will be truncated to "acs_object_util\
__get_object_typ"
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
FATAL 1:  user "root" does not exist
Lock file "/usr/local/pgsql/data/postmaster.pid" already exists.
Is another postmaster (pid 1075) running in "/usr/local/pgsql/data"?
DEBUG:  smart shutdown request
DEBUG:  shutting down
DEBUG:  database system is shut down
Lock file "/tmp/.s.PGSQL.5432.lock" already exists.
Is another postmaster (pid 32391) using "/tmp/.s.PGSQL.5432"?
/usr/local/pgsql/bin/postmaster: cannot create UNIX stream port
DEBUG:  database system was shut down at 2002-09-16 16:30:24 EEST
DEBUG:  checkpoint record is at 0/13FA0A74
DEBUG:  redo record is at 0/13FA0A74; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 1071418; next oid: 216948
DEBUG:  database system is ready
FATAL 1:  Database "testo-dev" does not exist in the system catalog.
FATAL 1:  Database "dotlrn-test" does not exist in the system catalog.
ERROR:  Relation "users" does not exist
FATAL 1:  Database "dotlrn-test" does not exist in the system catalog.
Lock file "/usr/local/pgsql/data/postmaster.pid" already exists.
Is another postmaster (pid 1219) running in "/usr/local/pgsql/data"?
DEBUG:  smart shutdown request
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  pq_recvbuf: unexpected EOF on client connection
DEBUG:  shutting down
FATAL 1:  The database system is shutting down
FATAL 1:  The database system is shutting down
FATAL 1:  The database system is shutting down
DEBUG:  database system is shut down
DEBUG:  database system was shut down at 2002-12-18 13:57:42 EET
DEBUG:  checkpoint record is at 0/13FA6C98
DEBUG:  redo record is at 0/13FA6C98; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 1071434; next oid: 216949
DEBUG:  database system is ready
DEBUG:  fast shutdown request
DEBUG:  shutting down
DEBUG:  database system is shut down
DEBUG:  database system was shut down at 2002-12-18 14:05:05 EET
DEBUG:  checkpoint record is at 0/13FA6CD8
DEBUG:  redo record is at 0/13FA6CD8; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 1071437; next oid: 216949
DEBUG:  database system is ready
FATAL 1:  Database "nsadmin" does not exist in the system catalog.dstem is shutting down
FATAL 1:  The database system is shutting down
FATAL 1:  The database system is shutting down
DEBUG:  database system is shut down
DEBUG:  database system was shut down at 2002-12-18 13:57:42 EET
Collapse
Posted by Tom Jackson on

When they did the reboot, the postmaster.pid file was not removed. This means the postmaster process didn't startup as usual.

First check to verify that postmaster isn't running:

$ps axww|grep post

if you don't see a process for postmaster with your data directory, then you need to remove the postmaster.pid file and try starting it again.

What I don't get is that you were following Bruno's instructions on Oracle. Are you installing both databases?

The problem with the Oracle install is that the startup files in /etc/rc.d/init.d/oracle8i shut Oracle down after sshd goes down, and also the websites are usually still running. If the webserver is still running, Oracle will never shut down. The local keyboard will already not work, and you can no longer log in via ssh to cure the error. The only thing you can do is to power cycle the machine.

So, how do you control the webserver to ensure it goes down before Oracle? Also, what are the links in /etc/rc.d/rc6.d that contain oracle8i and sshd?

If you use daemontools (svc) to shut down the webservers, I have an init.d script that does that first on shutdown of the server. I'll post some better instructions later today.

I also found bugs in the init.d scripts of listener8i and postgresql: both fail to create a lock file like the Oracle8i script, which means that the processes are not correctly stopped when you shutdown or reboot.

Collapse
Posted by Jarkko Laine on
Tom,

The problem is that postmaster was running. I was able to connect to one of three databases there should have been, but it was empty.

And yes, I was installing Oracle for testing of dotLRN for my school, my own site uses PostgreSQL (or used, now it doesn't exist anymore).

I don't think I had any scripts for AOLserver in rc.d:

[root@swissnet rc6.d]# ls
K02postgresql  K25squid       K49oracle8i   K70bcm5820  K86nfslock
K03rhnsd       K25sshd        K50snmpd      K72autofs   K87portmap
K05anacron     K30sendmail    K50snmptrapd  K74apmd     K88syslog
K05atd         K34yppasswdd   K50tux        K74nscd     K90network
K05keytable    K35smb         K50xinetd     K74ntpd     K92ipchains
K10wine        K44rawdevices  K60crond      K74ypserv   K92iptables
K10xfs         K45winbind     K60lpd        K74ypxfrd   K95kudzu
K15gpm         K46radvd       K65identd     K75netfs    S00killall
K20nfs         K49listener8i  K70aep1000    K80random   S01reboot
But I do use svc to control web servers. It's just this was probably the first time I had to reboot the machine...
Collapse
Posted by Patrick Giagnocavo on
These are my suggestions:

(Needless to say, make a full backup of all files if doing any file or db modifications.)

First, check the $PGHOME/data/base directory.  Do an 'ls' - perhaps the data is still there, just not accessible at this point in time.

If I recall correctly, each database has its own subdirectory under data/base .  The numbered directories are the OID of the database, I think.

One way (may take a while if you have a lot of data loaded) is to use grep(for text files) and the "od -a" or "strings" command (for binary files) to look through the various database files provided they still exist.

If you know a word or phrase that could only be in the database that is missing, grep through the various files there and see if you get a match.

Second, if the databases are present, but simply not recognized, the information may still be in the system catalog but messed up somehow.

Use pg_dump to dump out the information in text form and look at what there is.  Perhaps you can decipher a way to clean up the system catalog and get things working again.  No doubt the experts on the postgresql mailing lists will be able to assist.

The next question concerns your OS.  I assume you are running Linux - check your kernel version and OS for updates.  I know that I was bitten by a very bad bug in a now-obsolete distribution - the "bdflush" daemon that calls fsync() every 30 seconds was not running when the machine was booted, so changes I made would be recorded in the disk cache in RAM, but never actually written to the drive.

You may want to look through the "messages" log file and other logs in search of clues as to why this happened.  Usually the previous kernel messages from the next-to-last boot are recorded.

Finally, you should test your hardware, now that you know that it has failed.  The usual test is to compile a kernel, since it stresses the disk , CPU and memory subsystems.

Another alternative that has worked well for me is the Mersenne prime number detector, which heavily stresses CPU and memory subsystem - see mersenne.org and read up on the torture test setting.

Best of luck!

Collapse
Posted by Jarkko Laine on
Patrick,

There is three subdirs in base directory, of which, I think, two first are template0 and 1 and the last one the only real database left. So it seems that two databases are totally lost from the data directory. Fortunately they were just test servers.

What's more unfortunate, dumping the only database gives just the database creation phase, so it seems to have forgotten that there's been any new data after that.

I'll get into the OS and hardware stuff later, now I have to clean my room and fly back home for Xmas.

I was so pissed of for all of this that I hacked the nightly PostgreSQL backup script to use scp instead of ftp. It seems you have to first lose three months worth of blog entries before you really get the important things done. I'll share the script on Next Steps page as soon as I can see it working.

Collapse
Posted by Tom Jackson on

Yeah, your boot/reboot scripts are messed up just like mine were. Sshd goes down before Oracle, so if it hangs you can't login and fix it.

Also, I can't tell but if you used the stock postgresql scripts and the stock listerner8i script, they never actually are killed.

First to correct the order of kill/startup problem:

#!/bin/sh
# 
# chkconfig: 345 98 02
# description: Oracle8i starts the oracle database deamons ...

Note changes to the chkconfig line to kill on 02 (almost first) and start on 98 (almost last). Then as root:

# chkconfig --del oracle8i
# chkconfig --add oracle8i

Repeat the same steps for listener8i and postgresql. Also these scripts probably are broken. I'll show you my svc script so you can see what needs to be in them to work correctly.

#! /bin/sh

# chkconfig: 345 99 01
# description: svc start and kill all

# This is an example of a start/stop script for SysV-style init, such
# as is used on Linux systems.  You should edit some of the variables
# and maybe the 'echo' commands.
#
# add to starup via:
# chkconfig --add svc
#


# The path that is to be used for the script
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# What to use to start up the postmaster
DAEMON="/usr/local/bin/svc"


# Only start if we can find svc.
test -f $DAEMON || exit 0

# Parse command line parameters.
case $1 in
   start)
         echo "Starting All SVC Daemons:"
         $DAEMON -u /service/*
         touch /var/lock/subsys/svc
         echo "ok"
         ;;
   stop)
         echo -n "Stopping All SVC Daemons: "
         svc -d /service/*
         rm -f /var/lock/subsys/svc
         echo "ok"
         ;;
   restart)
         echo -n "Restarting All SVC Daemons: "
         $0 stop
         $0 start
         echo "ok"
         ;;
   status)
         svstat /service/*
         ;;
   *)
         # Print help
         echo "Usage: $0 {start|stop|restart|status}" 1>&2
         exit 1
         ;;
esac

exit 0


Add the svc script (as root):

# chkconfig --add svc

Notice the touch and rm -f lines in start and stop parts of the script. These lines were missing in my copies of listener9i and postgresql.

Also note that my script is a little aggressive, starting all scripts in /service/, even ones with the down file in their run directory. There is a definite need for improvement here for it to work according to original design.