Forum OpenACS Q&A: Five 9s reliability, how would you do it?

1: Five 9s reliability, how would you do it?

Posted by Jerry Asher on 11/27/01 08:03 AM

I have a client whose client is specifying five nines reliability. That's five minutes down per year (scheduled and unscheduled.) Presumably, they are willing to pay for the machines and infrastructure it takes to achieve this, and there is a substantial penalty for less than five nines performance.

I plan to address the five nines with:

load balancers and clustered aolservers (thanks for the BigIP recommendation)
raid
journaling file systems
geographically distributing primary and backup servers

What else should I consider? In particular, I am thinking of using Solaris rather than Linux/OpenBSD if only because AOL uses Solaris, so I figure, that's where it's been tested most. Do you feel that is a reasonable justification, or merely cargo cult sysadminning?

I would love to choose OpenBSD for various reasons, but "what do we know of AOLserver reliability on OpenBSD?" In reality, apart from our own personal experiences, what do "we" really know of AOLserver reliability in general on any platform?

I am also considering using different hosting companies for the primary and secondary server to reduce risks of hosting company implosion, but I wonder how that might reduce the ability to hire a megabuck network engineer to implement sophisticated (hopefully useful) network strategies.

My greatest concern is attacks, especially denial of service attacks. I don't see how we can detect and "eliminate" a DOS within five minutes. What are the best strategies in planning for and dealing with network threats? How might you design a system to withstand (be reliable and present) during a denial of service attack?

And what else am I overlooking, what else should I be doing?

2: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Jun Yamog on 11/27/01 10:02 AM

I think you can do away with the RAID and journaling file systems for the web farm. It is load balanced so you should treat each server just like a harddisk. Therefore using Linux for the web farm is ok besides I dont think Solaris has anymore advantages to Linux in terms of running aolserver.

Invest heavily on the db machine. Solaris + Oracle should do ok. Buy the baddest Sun box you can. Get the best people to continue to keep the site in shape. Investing heavily on the initial deployment is not as important as looking at how it will run once its deployed. Policies should be important too, like when to upgrade, how often log files are checked etc. People are the key components your Sun box with a average admin can be beaten by a good admin with only intel boxes.

Code movement/management will be also key since new code can bring down the site.

Maybe you can check back the client again after your study, a lot of this clients demands this outragous uptime where they really dont know what they are talking about. Especially when the bottom line comes out... for example for 10 min down time you just spend 100K for 5 min down time you spend 500K. Once the client see the figures they will decide what is good for them, there is no unlimited budget espcially these days budgets have shrunk.

3: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Bruno Mattarollo on 11/27/01 10:07 AM

Hello Jerry

On the hosting company side, I would recommend that you take a look at COLT. It's a European company, so if you are US based it might be an alternative to have a "mirror there". They provide on their SLA a very high degree of reliability. We, at Greenpeace, will be working with them.

4: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Bruno Mattarollo on 11/27/01 10:10 AM

Oh, one more thing, if you are going to use Oracle, then you should -without any doubt, take a look at Oracle Parallel Server (now called Oracle Real Cluster I guess). It costs a lot of money, but reliability will not only have to be in hardware, make sure you are using clusters with automatic failover and so on.

5: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by David Walker on 11/27/01 02:03 PM

I think DOS attacks are among the most difficult to deal with since they cane
be accomplished with spoofed packets. The easiest solution is to not be
where the attack is hitting, something you might accomplish with your
geographical distribution plan.

The more bandwidth your ISP has the more options you have and the more
work that must be done to actually deny your service.

Many attacks become much easier to weather if you can deny them at a
firewall. If you discover you are under attack from ICMP packets or packets on
a non mission critical port you can deny them and that will reduce the traffic
on your network (attack packets will still come in but no replies to them will
go out).

I'd suggest you don't offer 5 nines from the launch of the site but actually
starting at a later date so you can get the kinks worked out if any come up
initially.

Definitely look at every router, firewall, or piece of equipment between the
internet and your servers and make sure it is redundant.

I'm curious how successful Oracle is at running redundant database servers.
On Postgres my strategy is to use redundant hot swappable disks and have
a backup computer available in case of trouble. Try to accomplish 5 nines
with that strategy and you need a reasonally intelligent, educated person
sitting next to the servers at all times. (Which you might need anyway)

Even if your ISP has a super-duper redundant power system (and they should
if you want this level of reliability) add your own UPS as well to cover any
incidents that may arise. I've heard one suggestion of having redundant
power supplies, each connected to a different source of power that sounds
good to me. If one power supply is connected to your UPS and one to the
ISP's power than you can handle a power problem from your ISP or you can
replace your UPS.

How does an dynamic site handle having 2 geographical locations? It seems
very easy to end up with unsynced sites no matter what sync method you
would choose.

Make sure the customer knows that, just because they can't see the site
doesn't necessarily mean that it is down. (But probably does if you have
redundant geographical locations)

6: DoS prevetion possibility (response to 1)

Posted by good bye on 11/27/01 02:40 PM

I know that Arbor Networks is a company working on the DoS
problem. I don't know if their software is proven (or is finished)
but I do know they have some serious genius level security/
network hackers working on the system. You might want to drop
them a note to see what they offer, if DoS is your greatest
concern.

http://www.arbor.net

(note, i am not affiliated with arbor networks in any way)

7: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Jeff Barrett on 11/27/01 03:21 PM

A couple quick things.

Besides the BIGip take a look at the Alteon AD3, nice piece of machinery. (These load balancer companies have been going out of business lately so make sure you are working with a solid company. That goes for co-location places as well.)

Factor in the cost of setup and support by the people you co-locate with, they usually have the fastest response times. We are looking at Above.net to manage our networks (since the machines are there now), but we quickly discovered that we have to pay for them to 'setup' the network machines according to their standards, this means we get charged from a couple hundred dollars per network device to over 4,000 for some of the more advanced devices (redundant load balancer and firewall) to be setup, I think for the one install we have that was over 50k right there and different colocation facilities deal with different hardware and different setups.

We had talked to a company (forget their name) but they were using a device called NetArcs (could be wrong on the spelling) to do active network monitoring. I think it was just a set of software that monitored the network and the machines and would 'activly adapt' to problems presented by attackers and malicous code.

I wonder how hard it will be to get two colocation companies to deal with each other on setting up the nessesary infrastructure for redundancy. It is hard enough at times to get that done within one company.

8: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Adam Farkas on 11/27/01 03:30 PM

If the firm really wants five-nines, has a alarming amount of money and is willing to spend it, you may want to speak with arbor networks -- http://www.arbornetworks.com/standard?cid=4&tid=6

This product specifically addresses DoS attacks. It is not cheap.

9: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by mark dalrymple on 11/27/01 03:39 PM

If you look at OPS / RAC (oracle clustering), beware that aside from the costs and complication involved, is that the clustered machines need to be in close physical proxmity, since they share the same disk storage (something to ponder in case somene flies a plane into your colo facility). If you're going to spread things geograpically, OPS/RAC really won't help. You can do the standby database thing for that (if you can partition the databases into two machines - have each serve their database, and also be the standby for the other. If one goes down, activate the standby). I've heard of some folks that use Sun's clustering technologies rather than oracle's for doing database failover.

5 9's is pretty serious stuff. I'd definitely find an expert in the field and pay the five grand for a day of their time

10: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Jon Griffin on 11/27/01 07:13 PM

DDOS can't really be stopped quickly no matter what arbor says at this time as each point must be stopped one at a time. Also, that is most likely the ISP's domain as they are the ones with the routers and therefor you are at thier mercy.

Do you have to guarantee connection latency or just uptime. These are 2 very different things. DDOS is in reality an SLA problem and has nothing to do with five 9's in my eyes.

11: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Tom Jackson on 11/27/01 07:30 PM

You either have an idiot for a client, or a very wealthy and serious one. Don't invest too much time before you determine which it is. Assuming they are serious, check out the Clustra database. Their website is http://www.clustra.com/ and you will see that their database architecture is really designed for the serious client. For what you are talking about Oracle is 'old' technology. Also, Clustra will likely be cheaper than Oracle, and you can download it and use it right now!

Also you might breach the subject of what they mean by 5 nines. Do they want a web service up and 100% useable 99.999% of the time? Can the service tolerate a maintainance mode, (thinking of Ebay)?

Anyway, if the data is what needs to be essentially always available, then the database is what has to work flawlessly, putting a rack full of webserver clones togeather is the easy part.

As far as a DOS attack, I don't see how you can control or plan for all future attacks to the point of a contract guarantee. What you can do is hire a team of network engineers to wait around for the next DDOS and respond quickly (thinking of Yahoo).

12: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Jerry Asher on 11/27/01 09:37 PM

Hey folks,

Thanks for all the responses. A few follow-ups and then an invitation.

While this will be an AOLserver project, the ACS component at this time is small in terms of delivered functionality as to be almost non-existent. The ACS component is not involved in the five-nines, so actually this site will be using PG for the ACS, and some ODBMS to be named to deliver its primary functionality which may not even be delivered via HTTP. So we are not at this moment interested in Oracle at all, although I will take a look at Clustra.

One differentiator of ODBMS vs. RDBMS websites and brochures is that many ODBMSs claim to be specially configured for telecomm or similar applications that need "embedding", "low dba maintenance", "24x7xForever" and highly reliable and tolerant distributed architectures. I'm not sure what that means apart from brochureware, but I'll know more later. These are certainly the client's perceived requirements.

I am still not sure about the client's client. The claim is they are conservative, very serious, and reasonably well-heeled. I submit an invoice tomorrow, and that will be one test.

Invitation: I claim to be a software engineer, not a network engineer, not a database engineer, not a security specialist, and not a sysadmin. I am in charge of the servers. Part of my responsibilities is to put together a team of consultants that can put this thing together: determine requirements, design the system, participate in reviews, vet vendors, create software layers, build the thing, test it well, install it, etc. At this moment, it looks as though participants will be needed for a day or three at a time at several stages throughout the project. I may have more time for another deep, C/C++ based, AOLserver hacker, especially someone familiar with the hoops needed to get AOLserver to speak in multiple tongues simultaneously. I believe there is some serious AOLserver hacking to be performed.

If you are interested in participating anywhere from a few hours to a few days to a few weeks, please send me an email and include rate information, your specialty, and some details of your experience. I ain't promising anything of course, I am still trying to determine which of Tom's descriptions fits.

And please, let's do keep telling us all about what it takes to provide five-nines. I find this very interesting and educational, and I believe/hope it's an important topic for many OpenACS systems.

13: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Yon Derek on 11/28/01 09:44 AM

Not that it's terribly on-topic or helpful, but what you need the most is prayer.

If anyone is seriously claiming that they can guarantee five nines then they are seriously unserious. Not that it's impossible but it's simply a gamble and the house always wins and you're probably neither the house nor Daniel Ocean. Amazon can't do it, ebay can't do it, Microsoft can't do it and they all have loads of cash and loads of smart people and you may bet that they try real hard.

There are too many things you don't know (like: what will be the next vulnerability found in the software you choose to implement the solution) and too many things you don't control.

14: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Jerry Asher on 11/28/01 10:33 AM

Yon,

Would you be more specific? Neither apt-get nor rpmfind return anything for prayer. Is that a windows update kind of thing?

I think you're mostly right. The actual performance cannot be predicted, and so it is mostly a gamble, but it gives some understandable guidelines to shoot for when designing a system.

In the absence of external factors such as attacks or tractors cutting through fiber, five nines should be easy. It's sad for our industry that it is not. In most industries, <a href='http://www.google.com/search?q=six+nines';>six nines</a> is the quality goal folks shoot for. I think that having a client ask for five nines is in some sense reasonable. It shows they understand the difficulties of asking for six nines, and the importance of designing quality, reliability, and robust behaviors in from the beginning. It should offer a yardstick with which to make various design decisions, from using one piece of equipment over another, or choosing one algorithm from another. Have we over engineered? Are things becoming too complex? Or not complex enough? That's hard to answer without having some SLA.

I also don't know what the alternative is. Four nines is downtime of less than an hour. Three nines is downtime of less than nine hours. Two nines is downtime of less than four days. One nine is downtime of less than five weeks. These all suck. One nine is clearly unacceptable. Two nines doesn't feel right. Is that a goal I really want to design for? Three, four, and five nines all appear to be equally difficult to achieve, so which should we design for? I'd say shoot for five nines.

I've personally seen one of American Expresses uptime counters in the bowels of their rad hard, flood proof western regional operations center. That counter expressed percentage of uptime of their IBM networks showed a great many more than five nines, and that was before they moved their machine room to a dark (as in no lights), no humans allowed machine room. At the time, we were installing a network of 50 LISP machines that provided an expert systems based charge card assistance agent. As part of our six nines requirement, we provided an additional 10 hot spares and Symbolics tossed in the onsite siting of their Phoenix maintenance tech. Man, the notion of six nines, and a real-time garbage collecting expert system based on Symbolics LISP machines. What were we smoking? Oh yeah, their cash!

Besides for this project, we aren't trying to replicate eBay, Amazon, or Microsoft in terms of traffic, database transactions, or really anything. The main goal is a relatively simple service for the client that the client wishes to have five nines reliability for. That's a very different goal than that of the other three who wish to support maximimum traffic and maximum transactions as well as ease of use of the site. Right? The client values five nines over ease of use, or over all sorts of other factors.

I am concerned that five nines appears harder than six nines. Six nines is basically uptime forever, so a client that wants that is spending money to solve every imaginable problem. Four and five nines means the client will accept downtime but wants to limit it to under an hour or under five minutes. Without the six nines bankroll, that actually seems harder to me than uptime forever.

However, shooting for five nines from the beginning, or six nines even, means we will have an adequate supply of pre-loaded hot-spares, we will have technically competent people on hand at the ISP 24x7, we will have stressed various raid controllers, we will have actually tried the failover and failure recovery systems, and in general, we will most likely spend the bucks ahead of time to build the tests we need to test and stress the system.

Signing up for five nines is a bit like a general burning the bridges behind his army. It helps ensure everyone on the team knows which battle has to be won to get home.

15: DDoS attacks (response to 1)

Posted by Petru Paler on 11/28/01 11:30 AM

Regarding the DDoS problem -- as the other folks pointed out, it's very hard (if not impossible) to fix completely.

I didn't look at the Arbornet stuff (ironically enough, their site is down), but here are a couple things learned from experience (I live in Romania (which is in the top as the country originating most DDoS attacks, and being targeted by most DDoS attacks) and I'm doing consulting for a local ISP).

First, there is a very easy and straightforward solution for stopping a DDoS attack. Unfortunately, it also isolates the target from the Internet. It's very simple and can be done either manually or automated: whenever an DDoS attack is detected (usually by noticing that the amount of inbound traffic is much higher than usual, and that the source IPs look random), observe the attacked IP and insert a null route for it (with a /32 BGP prefix) in your local routing table (this assumes you have your own AS, or that you have a cooperant ISP/colo provider). This will propagate very quickly (BGP is a fairly low-bandwidth protocol) and, in a matter of minutes, no one on the net will be able to reach the attacked IP -- all packets will be dropped by the first router with the full BGP routing table that they reach. This happens because routers always pick the most specific prefix, and your IP/32 is as specific as it gets. So, the attacker's flood drones will go on, but the packets will be dropped by their local ISP and no one else is affected. Except for your site of course, which stays down until you remove the null route.

To actually keep the site up, you need to have a couple different locations (don't forget about distributed name servers!). How many? It depends on how badly people will want to take your site down. Most script kiddies can get enough flood drones to take down a 10MBps site. Not that many can take down a 100MBps site. Only a well determined group can take and keep down 4 or 5 100MBps sites. Of course, this implies *totally* independent sites. For example Exodus in CA and Exodus in NY doesn't count as two different sites because they are both in Exodus' routing AS so if someone attacks their routing infrastructure (or, more likey, they screw up something), both servers will go down at the same time.

Depending on how much money you have, you might consider having servers in California, NYC or DC, London or Amsterdam, Hong Kong, Tokyo and Melbourne or Sydney. That would be 6, and it would be quite a challenge to DDoS all of them at the same time.

16: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by MaineBob OConnor on 11/28/01 07:32 PM

Here is a relevent article... from this week's infoworld:

Always-on switches
James R. Borck
Enterprise Strategies

http://www.infoworld.com/articles/op/xml/01/11/26/011126opborck.xml

Here is an interesting part:

...IBM, for example, is leveraging its clustering experience in low-cost Linux systems with the release of the Linux eServer Cluster 1300 due out this week.

The eServer Cluster comes preconfigured and pretested with Red Hat Linux, IBM's Cluster System Management software for easy administration, and a global file system. And it supports a variety of interconnect and expansion possibilities.

If yours is one of those do-it-from-scratch shops, take a look at IBM's Cluster Starter Kit for Linux, freely downloadable from the IBM alphaWorks Web site (www.alphaworks.ibm.com). Sporting the IBM Cluster System Management software, the kit allows you to configure Linux clusters with as many as six nodes....

Gee... would this work well with OpenACS?

-Bob

17: Response to Five 9s reliability, how would you do it? (response to 1)

Posted by Andrew Piskorski on 11/28/01 07:42 PM

Mark, regarding the need for clustered Oracle machines to be in close physical proxmity, since they share the same disk storage: Taking out the local SCSI disks and replacing them with a fiber channel interface to networked storage, something like Storage Networks, would let you put the machines as far apart as you want, right?