Forum OpenACS Q&A: High availability of .LRN under Linux

Hi,

We are working on the installation of .lrn using an aAOL server+PostGreSQL under Linux.
The goal is a 100% available system, so that if one of the servers (there will be two in fact) is unavailable for a certain time, the impact on the final user is reduced to a minimum.
Can somebody give us a clue whether this is possible and how?

Thanks indeed. Victor

Collapse
Posted by Don Baccus on
The biggest issue is replication of the database ... you should check with the PG group to see what the current status is.  Without replication you're reduced to restoring from your most recent backup if you have a failure which causes your RAID disk data to go bad (you are, of course, running RAID, preferably RAID 1 not 5, aren't you?)

On the other hand you may want to think hard about whether the odds of losing data on both sides of a disk mirror is high enough to justify keeping an extra server around for database replication (not to mention the fact that this technology is still immature in the PG world).  The chance of losing both sides of a disk mirror is really very low and if you have hot-swap drives recovering from the loss of one is automatic once you replace the bad drive and shouldn't involve any downtime.

Collapse
Posted by Andrew Piskorski on
There is no such thing as a 100% available system, and any goal of trying to achieve that is not a real goal. Serious High Availabiltiy people talk about 99.9999% uptime, etc. But without knowing more my best guess is that you don't need anything like that at all, because very few people do.

Just how many users are you planning to serve with that dotLRN instance? How many of them using the site each day? How much does downtime hurt your users? And how important, critical, or irreplacable is the content each user contributes each day? You have to start with those questions first.

Last I heard, PostgreSQL has several asynchronous replication tools that meet some people's needs, and a better asynchronous replication tool (Slony) on the way. It currently does not have anything equivalent to Oracle's Archivelog Mode at all, but the Point in Time Recovery work currently underway should allow building something like it, and there seems to be interest in doing so.

Note that with Oracle properly set up in Archivelog mode, you can, at least in theory, guarantee that you will never lose a committed transaction. With PostgreSQL that is currently not possible. But in reality, how many PostgreSQL or Oracle installations does that really matter to, in practice? Probably not very many. (But the Archivelog feature certainly is nice to have.)

Ask yourself how much data you can afford to lose, and how much time you can afford to lose.

On the data, ask yourself: "Worst case, can I afford to lose all data in my database since my last PostgreSQL backup (typically 1 day's worth)?" If yes, great, just backup nightly like most all OpenACS users do.

On the other hand, if your answer is no, but your answer to the question, "Worst case, can I affored to lose some transactions, maybe a few minutes or a few hours worth?" is yes, then investigate the various current PostgreSQL asynchronous replication tool.

But in the unlikely event that your answer really is, "No, I can't afford to ever lose a transaction, no matter what.", then you should not be using PostgreSQL. You need to use Oracle (or something equivalent), and you need to put a whole lot more time, money, and thought into the problem.

In all cases, make sure you have good backups, and a good procedure for bringing up a new website using them, preferably on a spare machine. If you can't afford to be down for very long, there are various Linux High Availability tools that can help you fail over to a backup server, etc. Googling should find them. So far I havn't heard of anyone at all doing that sort of stuff with OpenACS, but it should be feasible.

Collapse
Posted by Joel Aufrecht on
For discussion of everything but db replication, see High Availability Configurations
Collapse
Posted by Steve Manning on
Don

I'm curious as to why you are favour RAID 1 over RAID 5?

    - Steve

Collapse
Posted by Denis Roy on
It sounds like you want to run your service from one box and use the other one in case the primary server fails. In this case, you would either need a small third box as a router with fail-over service or you should use a DNS provider who does this for you. In case your primary server is not available for some time, all requests get automatically redirected to the secondary server.

Depending on your budget, you should seriously consider buying more hardware. With only two servers, you put all your bets on just one server including all software and hardware working fine. And in this case you depend on database replication which for Postgres isn't quite up to the task yet according to others here in this forum (regarding replication, I am only familiar with Oracle).

You should also take into consideration that not only the data has to be the same on both servers but also the code of your .LRN installation. This sounds trivial but depending on how many developers are updating code and who can update the code on the production servers, this can be a bit tricky sometimes. We use CVS with different tags for staging and production and if new code was tagged for production by the project manager, the production servers of our load-balanced web services get updated automatically by a cron script.

In general, not only for performance reasons but also for downtimes due to maintenance of a server, I strongly recommend separating database and webserver and then run two webservers with a load balancer, maybe even even have a backup server for your database (which might even be one of the webservers in case you are on a tight budget).

There are many more things to think about. The higher the uptime that you require at all times, the more time and money you need to put into planning and hardware. And don't forget to think about peformance. We will probably be able to help you a bit more specific if you give us more information about your hardware setup, how many users you expect, and if you can afford some more hardware.

Collapse
Posted by Don Baccus on
Steve ... RAID 1 gives better performance and disks ample enough for your average .LRN installation aren't terribly expensive in the Big Picture view of things.
Collapse
8: More complicated RAIDs (response to 7)
Posted by Andrew Piskorski on
Steve, RAID 5 is typically slower for writes than RAID 1. I think it is supposed to be just as fast for reads. But of course RAID 10 generally gives better performance than either RAID 1 or 5, for both reads and writes. :)

Incidentally, I've never seen any good performance comparison for more complicated RAID setups using more disks. Say you wanted one volume with the fastest IO you could get. The traditional answer is, "Buy 4 of the very fastest SCSI disks you can get, and run them in RAID 10."

But, that is sort of a silly answer, because it assumes you are using only 4 disks, but in reality those 4 15,000 rpm SCSI disks might cost as much 12 7,200 RPM IDE disks. So is there some more complicated RAID configuration that would give you faster IO using those 12 IDE disks? I'd bet there is, but I don't know.

Basically, for different types of disks (size, speed, and cost), what are the optimum RAID configurations for various different trade-off point of storage vs. speed vs. cost? If anyone's done a good study on that anywhere, I'd like to see it.

Of course if you are mounting all the disks locally, there are practical limits too how many you can stuff into one box. But once you start talking about stand-alone storage boxes talking over a HyperSCSI, iSCSI, or Fiber Channel SAN to your other servers, the possibilities start looking much more open ended...