Forum OpenACS Q&A: Oracle problem - any ideas?

Posted by Janine Ohmer on
I've just posted this to Metalink but I thought I'd post it here too
in case this has happened to someone else:

This system crashed while I was importing a large table (lots of
LOBs). When I start up I get a message that says that system.dbf is
more recent than the control file.

First I tried starting up with each control file in turn - no luck
(though now it won't start up at all if I switch away from the control
file I'm currently using). So I moved on to "recover database using
backup controlfile".

It needs sequence #16163. I tried telling it to read from redo01.log,
which was the active redo log at the time of the crash, but it tells
me that log contains sequence 16162. So then I tried redo02.log, which
it was unable to read (internal error code, arguments: [3020],
[12609233], [1], [16163], [84], [424], [], []) and redo03.log, which
of course was too old (sequence 16161).

It looks to me like the needed sequence wasn't written, due to the
crash. Maybe it was about to switch to redo log 2, and crashed during
the switch? Whatever, I just need to get this database up and running!

Is there anything I can do here, short of reinstalling and loading
last night's dump?

It is times like this when I realize that I don't know as much about
Oracle as I need to! :)

Posted by Janine Ohmer on
After much blood, sweat, and reading of bad writing on Metalink :), I solved the problem.  Here is my solution for posterity:
(I'm leaving out a *lot* of steps I went through to figure this out!):
Edit init.ora and try each control file in turn until you find one that Oracle likes (it will spit more errors about your control file being out of sync if it doesn't like the one you chose).
startup mount<br>
recover database using backup controlfile<br>
Add these two lines to your init.ora file:<br>
Presto!  Now I'm doing a full export, and will recreate the database and import the data.
Even after getting a successful startup, which should have reconciled everything, it will not start up without fast_start_parallel_rollback set to false.
At this point, only the control file you've been working with is valid;  the others are hopelessly out of date.  I'm not entirely sure what the recovery process is for that, but it doesn't matter since everything's getting blown away shortly.
Time to read up on archive logs.  I have a sneaking suspicion that doing nightly dumps is really not all that's required here! :)
Posted by Andrew Piskorski on
Janine, I have no idea about that particular problem, but I also wasn't clear, was your database running in ArchiveLog mode or not?
Posted by Andrew Piskorski on
Janine, my Oracle notes have the "let's improve how we run Oracle" stuff I learned - setting up archive logs and nightly hotbackup, moving control file copies to different disks, adding redo log members, etc. Also, the Oracle hotbackup scripts thread has related info.
Posted by Janine Ohmer on

All good stuff, and we do need to improve in this area.  However, as far as I know it wouldn't have helped with yesterday's problem.

What happened was that the system dropped like a rock while I was doing an import.  The initial problem was that the control file was newer (had a higher SCN) than the system dbf file.  The solution to this problem is to either move forward using log files, or roll back to a point in time where all data is available.

I had all the pieces I needed.  The problem was that the very last change wasn't written to the redo log, because of the unexpectedness of the system going down.  I should have been able to get Oracle to cleanly roll back to a previous point in time, but for unknown reasons it just wouldn't do it.  So I finally resorted to that undocumented flag, which overlooks the fact that the redo logs are junky and reconciles things as best it can.

With the backup strategies you're talking about, all of which are better than nightly dumps, I could have restored from backups without losing a day's worth of entries, and that's a good thing.  I've been meaning to look into better strategies and haven't had time, so I'll definitely look at your notes.  But at the moment I'm more interested in understanding recovery strategies.  As I understand it, even if I had archive logs I would have had the same chance of them being out of sync, and I would have been in the same situation I ended up in anyway.

In any case, I got past that problem only to create a new one.  I installed the patch on this system, which is apparently incompatible with RH 7.2.  Completely hosed the Oracle installation. :(  At first I wasn't sure if the problems I was having were due to database problems (Oracle did warn about subtle corruption after using that flag) or what, so I did a fresh Oracle install, added the patch, and boom, Oracle no workie once again.  So now I'm moving the site to a temporary location so we can upgrade that system to 7.3, which I know does work with the patch.

This is *not* how I wanted to spend this weekend! :)

Posted by Andrew Piskorski on
Yes, if things broke because Oracle didn't write a transaction to the
redo log properly during an import, well, that sounds like exactly the
sort of thing using Oracle is supposed to avoid, doesn't it?  But if
Oracle crashed, then maybe all bets are off - not good.  I don't
remember enough about how the control file is supposed to interact
with the redo logs to guess what's really going on there, though.
Definitely sounds like one to open an Oracle TAR on.  Please let us
know if you get any useful info out of them!  (And whatever else you
figure out yourself, of course.)
Posted by Janine Ohmer on
Yes, this definitely sounds like the sort of thing Oracle is supposed to avoid.  I suppose if we had archive logging turned on we'd have one or more extra logs being written to and more chances for the needed info to make it onto disk, but we'd have more chances for failed writes too, so I'm not sure whether that's a net plus or minus on this particular point.

I won't be opening a TAR on this.  In order to do that I'd have to leave the system untouched, and I've already done as much recovery as one can do so the evidence has been spoiled.  And today we'll be reinstalling the system with RedHat 7.3 so I can install Oracle with all the proper patches, so then the evidence will be gone.  This is the most powerful system we have, and the site's currently on a lesser system and really chugging.  I can't wait around on this.

I did post about this to the forum on Metalink, so if someone from Oracle is interested in knowing that it happened, the record is there for them to see.  There were plenty of people with this problem posting there, but most of them had come about it differently.  They'd experienced a full system failure, had to restore from cold backups, and found that their backup was hosed because Oracle wasn't really shut down when they took the backup like they thought it was.

On the whole, I'm actually pretty happy with the nightly-dump backup strategy.  There are plusses and minuses to cold and hot backups;  with the dump method there's a high likelyhood of having to reinstall Oracle to recover from a catastrophic failure, which the cold backup method avoids.  But IMHO the likelyhood of either AOLserver or Oracle failing to start back up some night is greater than the likelyhood of having a catastrophic failure (I think this is the first one I've had in 3.5 years).  I think if I add archive logging, which saves me from losing the day's changes, we'll be in good shape.

Posted by Jon Griffin on
I got burned by not running in archive mode in almost the exact same way and had to just settle for importing last nights dump.

After that I set up everything for archive mode and also a seperate server to have a backup of the archives. I will try to search for the notes, but I may be stuck typing them back in.

In any case, I also learned, if you value your data you better run in archive mode ( and even that has its pitfalls, although not being a full time dba I may have just done it wrong).

Posted by Kevin Crosbie on
When you say a nightly dump, do you mean a cold copy, or using oracle export?
Posted by Janine Ohmer on
I do nightly dumps.  I know some other folks do cold backups, but I don't like to take the risk of shutting down and restarting Oracle and AOLserver to make them.