Basically the customer wants a network layer in which he could
press the "enter" key on the client, and moments later have the
client undergo complete destruction, or have any intermediaries
undergo power failure or software crash. Nevertheless, the customer
wants the system as a whole to not lose any data.
There's always going to be some type of time element, whether it is
microseconds or minutes from the time the enter key is pressed until
the transaction is in a secure state. All you can do is make sure
you don't get any half-transactions in there and that the time
period is as small as possible.
How would a 2 phase commit system handle a failure in the network
layer?
Say you're commiting to database servers db1 and db2 from client1
and client2 (it doesn't have to be a database server but that's the
easiest way for me to tell it). A failure in the network layer
causes a split between area1 and area2 so that client1 can only see
db1 and client2 can only see db2. Then the split heals but db1 and
db2 are no longer synced and each has commits that the other
doesn't.
(Syncing file systems could have similar problems. Say 2 people
edit the same file and save it during the split. Which one wins?)
Our approach has just been a single powerful machine with raid and
that kind of redundancies. I can't tell you if it's a good or bad
solution as that machine and that network haven't had much stress.