Forum OpenACS Q&A: Response to Five 9s reliability, how would you do it?

Collapse
Posted by Jerry Asher on
Yon,
<p>
Would you be more specific?  Neither apt-get nor rpmfind return anything for prayer.  Is that a windows update kind of thing?
<p>
I think you're mostly right.  The actual performance cannot be predicted, and so it is mostly a gamble, but it gives some understandable guidelines to shoot for when designing a system.
<p>
In the absence of external factors such as attacks or tractors cutting through fiber, five nines should be easy.  It's sad for our industry that it is not.  In most industries, <a href='http://www.google.com/search?q=six+nines';>six nines</a> is the quality goal folks shoot for.  I think that having a client ask for five nines is in some sense reasonable.  It shows they understand the difficulties of asking for six nines, and the importance of designing quality, reliability, and robust behaviors in from the beginning.  It should offer a yardstick with which to make various design decisions, from using one piece of equipment over another, or choosing one algorithm from another.  Have we over engineered?  Are things becoming too complex?  Or not complex enough?  That's hard to answer without having some SLA.
<p>
I also don't know what the alternative is.  Four nines is downtime of less than an hour.  Three nines is downtime of less than nine hours.  Two nines is downtime of less than four days.  One nine is downtime of less than five weeks.  These all suck.  One nine is clearly unacceptable.  Two nines doesn't feel right.  Is that a goal I really want to design for?  Three, four, and five nines all appear to be equally difficult to achieve, so which should we design for?  I'd say shoot for five nines.
<p>
I've personally seen one of American Expresses uptime counters in the bowels of their rad hard, flood proof western regional operations center.  That counter expressed percentage of uptime of their IBM networks showed a great many more than five nines, and that was before they moved their machine room to a dark (as in no lights), no humans allowed machine room.  At the time, we were installing a network of 50 LISP machines that provided an expert systems based charge card assistance agent.  As part of our six nines requirement, we provided an additional 10 hot spares and Symbolics tossed in the onsite siting of their Phoenix maintenance tech.  Man, the notion of six nines, and a real-time garbage collecting expert system based on Symbolics LISP machines.  What were we smoking?  Oh yeah, their cash!
<p>
Besides for this project, we aren't trying to replicate eBay, Amazon, or Microsoft in terms of traffic, database transactions, or really anything.  The main goal is a relatively simple service for the client that the client wishes to have five nines reliability for.  That's a very different goal than that of the other three who wish to support maximimum traffic and maximum transactions as well as ease of use of the site.  Right?  The client values five nines over ease of use, or over all sorts of other factors.
<p>
I am concerned that five nines appears harder than six nines.  Six nines is basically uptime forever, so a client that wants that is spending money to solve every imaginable problem.  Four and five nines means the client will accept downtime but wants to limit it to under an hour or under five minutes.  Without the six nines bankroll, that actually seems harder to me than uptime forever.
<p>
However, shooting for five nines from the beginning, or six nines even, means we will have an adequate supply of pre-loaded hot-spares, we will have technically competent people on hand at the ISP 24x7, we will have stressed various raid controllers, we will have actually tried the failover and failure recovery systems, and in general, we will most likely spend the bucks ahead of time to build the tests we need to test and stress the system.
<p>
Signing up for five nines is a bit like a general burning the bridges behind his army.  It helps ensure everyone on the team knows which battle has to be won to get home.