Forum OpenACS Development: High Availability Configurations
- Access logs: These all need to be collected, so each hot server needs to log to a uniquely named file, but all such files must be identifiable. Any server that isn't hot mustn't log within the naming scheme of hot logs. It should be easy to find the access log for any server, even warm ones, for debugging purposes.
- Error logs: Each server should have a uniquely named error log.
- Outgoing email: Warm servers should email to the log or to a safe email, to prevent accidental spams. However, if you're not careful, a warm server pointing to a live database may do a batch run and send what should be real email to a log file. I think the Clustering params help address this by telling a server not to do sweep procs, but I haven't found the docs yet. And I don't see how to set the clustering params from config files.
- Database for warm servers: Should warm servers point to the live database? If not, to what database?
- Read-only database mode: It would be nice to keep a full site up while upgrading or otherwise offline; for sites with constant input, it may be a necessity. One way to upgrade is to make a copy and then upgrade the copy. The original site must then not accept any new data, or else that data will not be on the upgraded copy. But a "down for maintenance" sign is unimpressive. How can an OpenACS site be flagged read-only so that most functionality is still available, but incoming data is rejected without ugly error messages?
- Load balancing. I've had good success with BigIP in the past, but it's not free in any sense. What are people using to load-balance multiple web servers? I found a little app called "balance" which appears to be capable of doing the basics, but I've only used it for same-machine hot/warm, not for hot servers on multiple machines.
- Splitting db and web server: This one, at least, is easily solved and is documented. Is it providing a big performance boost? And then there are the hard things: scaling the database up or database clustering. Anybody working with this/having need for this?
Load balancing (high performance) and HA (high availability) interact but really aren't the same thing at all. By "load-balancing" people generally mean using multiple front-end web server boxes all talking to one big honking RDBMS box. If you're concerned about HA, then your number one concern is, "How do I make sure that the one RDBMS doesn't go down, and what happens when it does??". Which means you've got to decide whether you're going to lose all data back to your last nightly backup or dump, or if you're going to sign up for making sure you never lose a committed transaction, and how certain you need to be that you really really never lose a committed transaction.
Depending on your uptime requirements, never losing a committed transactons means looking closely into things like where (and in how many redundant places) to put Oracle's archived transaction logs, storage area networks or other ways for multiple Oracle instances to read the same physical database files, Master-Slave databases with failover, stuff like that. And remember that if you plan to restore from backup, how long that restore takes could be a real problem too.
In all cases, whether you're concerned with high availability (HA), high performance (HP), or both, the RDBMS is typically the most complicated and thus most difficult part. AFAIK there aren't any out-of-the-box solutions to any of that.
Note that PostgreSQL currently has fewer features than Oracle for this sort of stuff (e.g., no archived transaction logs and thus no "point in time recovery"), but (unsurprisingly, being open source), has more flexibility and variety of possible tools and solutions that might be useful in the future, more opportunity to roll your own. The Oracle stuff isn't necessarily too friendly even if it does work though (e.g., archive log mode is instance wide, no way to turn it on/off with any finer granularity than that).
Regarding scheduled maintenance, any real site should have a simple "down for maintenace, come back at time XYZ" tool no matter what. There will always be some upgrade that needs it, no matter what other fancy uptime features you have.
Making the site work properly in a read-only limited functionality mode during upgrades or whatever is a nice feature, but that's real development work and is probably quite site-specific in many cases. Probably nobody's going to do that unless it's a real business requirement for their site, not just a "Oh, that would be nice to have" feature. I'd be curious to know if anyone's done it in practice. The business case for some sites always the luxury of scheduled downtime during certain non-business hours - if you can get that, grab it!
On front-end load balancers, something functionally like the Big IP router (as opposed to round-robin DNS or whatever) is the way to go, but I've been told that underneath, the Big IP is basically just standard PC hardware plus proprietary custom software. A Linux box with the right software should be able to do the same thing, and generally would be better. (E.g., back at aD, I remember people complaining that the stupid ad-hoc configuration language to tell the Big IP what requests to forward where didn't let them do what they wanted. An open source solution wouldn't have that problem.)
I'm not familiar with software to turn a Linux box into a big-IP-like front-end load balancing router, though. Presumably it is out there in some fashion. I too would like to hear what others have done there.
- Intel 7115 hardware SSL (ebay $300)
- Cisco/Arrowpoint load balancer (ebay $500)
- Sun v100 web servers (sun $1000)
- Sun ?? database server (sun >6000)
- Oracle $$$
I prefer Sun hardware (for remote management) but Linux would work fine. The problem is the only place you'll save any money is the database server and that should really be a 64bit box if your database is more than a few gig. Plus the database server is not the place to cut corners. 64bit Linux may be too new to be considered high-availability. You'll need at least 6 webservers and 2 of everything else. Hardware cost > $20,000.
I used Oracle standby server to replicate the data to a warm duplicate setup in a separate data center. Making the switch to the backup data center was a manual process. In theory it's not that hard. We never had to do it for real. You might be able to do this with Postgresql now.
Access and Error logs are copied back to separate machine every day
I had one machine dedicated to scheduled tasks. It was the only one that sent email, plus various other things. If that machine failed it was a manual process to designate another as master.
Oracle standby in 9i is pretty good. Unfortunately because ACS uses index organized tables you lose the read-only functionality
With the load balancers you can do upgrades to each web server without taking the site offline. This works well for small incremental releases. Large upgrades might be different. It took 48 hours to an 8i to 9i upgrade mostly because of import/export time.
The database server is the weakest link. Not because it's unreliable but because it's central to the process and difficult to have backups. 10g rac on 64bit Linux might change that.
In short cheap front end stuff (ssl, load balacers, webservers ) and have a bunch of them. Don't worry about dual power supplies, raid drives etc. have whole spare boxes. Put your money in the database server with dual supplies, raid etc.
- There's no reason that smaller sites, without $20,000 hardware farms (which are of course themselves only medium, not large), shouldn't have nearly the same uptime. The biggest uptime killer for small sites I've worked on has been software error - specifically, openacs.org upgrade issues and platform upgrade issues. I think that going to an A/B rollover approach for most routine upgrades (A is live; copy A's database; upgrade B; test B; replace A with B) would help a lot, and I'd like to make it part of the standard, documented setup for production sites.
At this level, the problems I outlined above apply - non-conflicting logs, smooth load transferring across several servers, etc. It's almost all trivial, but if it isn't documented as a standard, then each new admin has to figure it out for themself and each OpenACS install is different. Barry, how did you designate a single machine to run scheduled tasks? Through the kernel parameters via the admin UI?
Another benefit of using basic HA tools even on a single box is that it make it very easy to start scaling up.
- The first step up from everything on one box is to get a web server on one box and database on another. This is already documented for PostGreSQL - anybody have similar docs for Oracle?
- The next step up is multiple web servers, as has been described above. Arrowpoint and BigIP have both been used to load balance OpenACS and I'm experimenting with balance, which appears to offer at least 50% of what BigIP does (no heartbeat or smart load balancing, no SSL-specific stuff, but has failover and session-maintaining load-balancing). (BigIP is now mostly on custom, proprietary hardware, I understand. One reason is to get good performance for encrypted connections.) Is this something Apache, pound, or even AOLserver 4 could also do?
- The fourth level would be multiple databases; is anybody running OpenACS with multiple database servers? What's the largest PostGreSQL database with acceptable performance? Monitoring applies across all levels; we have uptime as one tool. Of course inittab/daemontools is a given. Individual server keepalive tests each server for actual responsiveness and kills the OpenACS process if it becomes too unresponsive - crude but effective. Other monitoring tools?
However, perhaps the important (and overlooked) piece of high availability is how long to recover from a catastrophic failure. My goal was max 4 hours. The longer you have the cheaper things get. If you don't need a 2nd data center you can cut the cost in half. Unfortunately without one it's difficult to calculate max downtime.
I remember seeing a project which provided a really simple replicator for PostgreSQL, but can't find the link now. It was basically a proxy which emulates the PostgreSQL backend and sends off the requests to 2 separate PostgreSQL instances.
Is pound now stable with streaming pages? It used to break OACS.
From the Pound man page:
HIGH-AVAILABILITY Pound attempts to keep track of active back-end servers, and will tem- porarily disable servers that do not respond (though not necessarily dead: an overloaded server that Pound cannot establish a connection to will be considered dead). However, every alive_check seconds, an attempt is made to connect to the dead servers in case they have become active again. If this attempt succeeds, connections will be innitiated to them again. In general it is a good idea to set this time interval as low as is consistent with your resources in order to benefit from resurected servers at the earliest possible time. The default value of 30 seconds is probably a good choice. Set the interval to 0 to disable this feature. The clients that happen upon a dead backend server will just receive a 503 Service Unavailable message. The ha_port parameter specifies an additional port that is used only for viability checks: if this port is specified in a BackEnd directive, Pound will attempt periodically (every Alive seconds) to connect to this port. If the port does not respond the server is considered dead.