Forum OpenACS Q&A: PostgreSQL and Journaling FS

Posted by Dave Hwang on

I'm trying to get familiar with Postgres, and I ran across this statement in the PostgreSQL FAQ:

PostgreSQL runs in two modes. Normal fsync mode flushes every completed transaction to disk, guaranteeing that if the OS crashes or loses power in the next few seconds, all your data is safely stored on disk. In this mode, we are slower than most commercial databases, partly because few of them do such conservative flushing to disk in their default modes. In no-fsync mode, we are usually faster than commercial databases, though in this mode, an OS crash could cause data corruption. We are working to provide an intermediate mode that suffers less performance overhead than full fsync mode, and will allow data integrity within 30 seconds of an OS crash.

Would using a journaling FS like ReiserFS or ext3 protect against the type of data corruption possible in no-fsync mode?

Posted by Jerry Asher on

That's a great question and I look forward to the answer.  I'll add this to your question: for those of us not using journaling file systems, what is it that "most commercial databases do"?  Does a standard installation of Oracle (that is not using an Oracle file system or appliance) operate in fsync mode, no fsync mode, or something in between?  Is there a period of time in which a power failure or OS crash would cause data loss?

Posted by Patrick Giagnocavo on
Postgres 7.1 and higher use a recovery log, aka write ahead log.  Changes to the database are written to this logfile, then once a transaction is completed, the logfile has that change removed, IIRC.

The idea is that by simply replaying whatever is in the write ahead log in the event of a crash, all changes can be properly recorded, and the database can be back in production mode that much quicker.

I doubt that using a journaling fs would change much.  I believe that PG no longer does an fsync to disk on every transaction as of 7.1.

Oracle and DB2 at least, offer the option to use "raw" devices, where you have a disk or part of a disk that does not have any filesystem on it.  The DBMS has exclusive control over it.

Posted by Roberto Mello on
From the PostgreSQL documentation (

"Briefly, WAL's central concept is that changes to data files (where tables
and indices reside) must be written only after those changes have been logged - that is, when log records have been
flushed to permanent storage. When we follow this procedure, we do not need to flush data pages to disk on every
transaction commit, because we know that in the event of a crash we will be able to recover the database using the log"

"The first obvious benefit of using WAL is a significantly reduced number of disk writes, since only the log file needs to be
flushed to disk at the time of transaction commit; in multi-user environments, commits of many transactions may be
accomplished with a single fsync() of the log file. Furthermore, the log file is written sequentially, and so the cost of
syncing the log is much less than the cost of flushing the data pages."

What I've read on other threads is that if you are going to use a Journaled File System (and it is a good idea to do so), then you should put the PostgreSQL WAL file in a different, non-journaled partition to avoid double-logging.

Aynone would like to expand?

Posted by Don Baccus on
To follow up on Patrick's comments, the quote from the FAQ is out of date.  The "we are working to provide an intermediate mode" is exactly how 7.1 works now, and is how most commercial DB's work:

1. The system writes information about changed tuples (rows) to the WAL (write-ahead log, functionally equivalent to Oracle's REDO log) when it processes UPDATE/INSERT/CREATE commands.  The WAL is fsynch'd at the end of every transaction.

Every 30 seconds (a configurable number, BTW) the dirty pages in the shared memory buffer are flushed to disk and data files are fsynch'd.  This is done to prevent the need for an unbounded WAL (WAL segments can be reused once the data being tracked has been safely fsynch'd on the disk).

This increases the speed of inserts and updates by roughly an order of magnitude when the DB's subjected to typical ones which change relatively short rows.  Inserting big chunks into large objects won't perform so well, but no one expects this to run fast anyway.

I've heard mixed results of PG on a ReiserFS partition (it only journals metadata, BTW, so won't prevent data loss on a crash).

Personally my bet's riding on RAID 1, a good UPS and a non-journaling file system ...

Posted by mark dalrymple on
> I've heard mixed results of PG on a ReiserFS partition
> (it only journals metadata, BTW, so won't prevent data
> loss on a crash)

Gee, right when I was about to make that point too. Journaling file systems just ensure that the file system metadata (directory structure, file location tables, etc) is consistent. It doesn't journal the user data changes. Applications are still responsible for doing what is appropriate to protect the data.

Posted by Dave Hwang on

Ext3 works in several modes: writeback, journal, and ordered, which log respectively, metadata-only, both data and metadata, or something in between. (Ext3 FAQ). So at least with ext3, it is posible to protect data in addition to metadata, but it seems like this is already being accomplised with WAL.

So using a journaling fs may not give me additional coverage over the write-ahead log, but I'm still inclined to use it so that I can avoid a costly fsck on my data partition. However, now there is the issue Roberto raised about potential double-logging if you try to use both. It seems odd that my decision to use a journaling fs on a partition would depend on whether or not an application writing to that partition has its own journaling method. I'll have to think about that more.

Posted by Don Baccus on
Yes, ext3 will journal data if you ask it to.  This makes it potentially a great choice for general desktop systems where you're using lots of applications that do simple-minded file I/O.  This set-up should be more robust than an ext2 filesystem (though don't make the mistake of thinking you don't need backups!).

But all this extra work done by the filesystem just gets in the way of an RDBMS.

An important point - filesystem journaling can't replace RDBMS REDO/WAL logging.  The write-ahead log is logging changes made to pages in the RDBMS's shared buffer pool that haven't been written to the filesystem. (Oracle uses different terminology but the same technique).

Such changes are naturally invisible to the filesystem, thus to the filesystem journal, even if it is logging data as well as metadata changes (such as is possible with ext3).

So RDBMS ACIDity requires the db server to journal whether or not the filesystem journals data.  You could, I suppose, write each and every change to the shared buffer pool to the filesystem but the performance  implications of doing so are obvious and not pleasant to contemplate.

Hopefully ext2 partitions won't go away entirely.  If journaling filesystems become the only ones available the PG group may be forced to support raw partitions, something they dearly want to avoid.

Posted by Rodger Donaldson on
A data point on file system overhead with databases:Oracle developers recently announced they see a 30%+ improvement performance in Linux when moving from ext2 to raw partitions.

I would expect to see more overhead from ext3 (or any other journalled FS), since there are additional seeks/writes to the journals.

As far as running no-fsync + journal protecting you while gaining performance: the data corruption comes in the window where you're holding the data in RAM, before the FS layer tries to write it down to disk[1]; a journalling FS won't help with that, because it only commits data to the journal periodically.  You could reduce that period, but then you'll be paying the same performance penalty you do for allowing Postgresql to run fsync() constantly.

Also, note that Reiser only journals metadata, not data; it'll leave you with consistent filesystems, but can leave you with corrupt data on them.  It's also optimised for the case of many small files (proxy caches, news servers, etc), not for a small number of large files (RDBMS).

ext3 can journal data, but it'll cost you for performance, since data is being written twice.

Really, the best answer for data you really care about is dual power supplies, one on city power, and one running through a UPS, and disks with small caches, or caches that can be write-disabled.  If IO is your bottleneck, RAID 10 is the answer.

[1] Of course, with modern discs with 4+ MB caches that buffer writes, even calling fsync leaves a window of vulnerability, since data in the cache of the disk may be lost.  This gets worse if you use hardware RAID that doesn't have battery backup for the drives and its cache, since it's unlikely you'll save your 64 MB+ of buffered writes to disc in the event of a power loss.

Posted by Don Baccus on
Don't run PG without fsynch.  PG 7.2 greatly improves insert/update performance due to fsynch'd write-ahead logging and periodic checkpointing rather than constant data fsynching. The disks should finish writing their cached data in the event of a software crash (of the RDBMS or OS), so you are protected in this case.

However, you're vulnerable to power failures.

Ideally you get around this problem by having enough battery to allow all writes to finish before your system shuts itself down when told to by your smart UPS.  The one that never fails and the one whose battery  you remember to check every once in a while, even though that cheap, low-end local ISP colo service you're using doesn't provide 24/7 access and is in a very bad neighborhood.

That gets you around the power failure problem.

It doesn't save you from hardware problems.  This isn't terribly important in the sense that you can't protect against hardware problems in any absolute sense, but can only try to remove single points of failure and hope that redundancy keeps problems away.  Mirroring with separate controllers for each side, for instance.