Forum OpenACS Q&A: Re: Business grade OACS hosting/managed services?

Collapse
Posted by Mike Sisk on
I was reluctant to respond to this since I don't think blatant self-promotion belongs on these forums, but this is a subject that comes up from time to time and it's something we know a fair bit about.

Basically, furfly's core business is what you're asking about. We've been doing this for 5 years now and host high-bandwidth ACS sites for folks like The New York Review of Books, MIT's Archnet Project, and Edward Tufte.

We're expensive; our hosting prices start at $500 a month and up depending on needs.  But you can't provide enterprise-class service on the cheap.

Ok, enough self-promotion. Here's some things that in our experience is necessary for high-performance and high-availability hosting:

First, you gotta have a good network. It doesn't make any difference what kind of server hardware you have if your network is junk. You have to go up the food-chain as far as you can afford. Leasing space from your brother-in-law who is using space from an ISP that's using space from a bandwidth broker who's leasing from a real tier 1 host just ain't gonna work. If any of those folks in that chain can't make their monthly payment you're screwed.

We've been with Exodus (actually owned by Savvis now) since the beginning and deal directly with them. And while there's no guarantee that Exodus won't run out of money and lock the doors on any given day, there's as least some comfort knowing that if that happens the sites of folks like Yahoo!, Google, Slashdot, and Microsoft will go down, too. [Actually, the first time Exodus went into bankruptcy we were sent a memo that President Bush has signed off on Exodus being a "Important Infrastructure Utility" or something and that the US government would guarantee the continued operation of the datacenters.]

If you deal with a tier 1 host like Exodus, Level 3, or XO a lot of little problems go away, too. Power will always work no matter what (the Exodus datacenter a few blocks from Ground Zero continued to operate during 9/11), you'll have strong physical security, air conditioning and fire suppression. The actually network is likely to be good with multiple redundant connections.

In the 5 years we've been with Exodus (and several years of experience with them before we started furfly) we've never had a systemic power or network failure. None.

Now, after your network and physical space is taken care of you need to look at hardware.

First, you need a good network switch if you're not being provided one. And a spare. And these need to be enterprise-class since they'll be running and loaded 24/7. Cisco is good but we've been happy with the HP ProCurve series. Don't go cheap here and get a hub from CompUSA -- you'll regret it. If you have lots of money and need more stress in your life you can get fancy and expensive highly-redundant units that monitor each other. Otherwise keep a spare onsite that you can use if the primary fails. This is important as the switch is a single point of failure -- if it dies your network is off-line.

What servers you need really depends on your needs. We like the Dell rack mounts with redundant power supplies, internal SCSI disks on hardware RAID and running Linux. Sun is good, too. I'm running one Apple OS X server right now as a test -- it shows promise.

OS doesn't matter much, either, as long as it runs the software you need and you know how to manage it. Most of ours are Red Hat Linux, the newer systems using their Enterprise offerings. FreeBSD, OpenBSD, OSX, and Solaris will all do the job.

I like to keep things simple. You can do things like load-balancing, system-fallover prevention, and virtual servers but all these things add complexity -- you need to ask yourself how important these things are to the task at hand and if the added complexity is worth it. Photo.net recently upgraded for their quad-CPU Sun box to a new Dell machine with externally mounted RAID arrays. You can go over there and read how the added complexity is working for them.

Keep spare parts around, too. You never know when something minor like a fan will fail and cause a system to overheat and crash.

Monitoring is next. There is a whole range of products out there from the disk space checking scripts in Red Hat to more in-depth packages like NetSaint (now Nagios) or Big Brother. You should pick one, use it and have it send problems to a pager or cell-phone you give to your sysadmin.

Speaking of your sysadmin, the single most important thing is having one that knows what to do. The subject of what makes a good sysadmin is a subject for another time, but in general, there seems to be an inverse proportionality between how good they are and the number of certifications they have.  What's important is not what they know, per se, but how good they are at figuring out how to solve problems, especially during an emergency situation.

Collapse
Posted by Jesse Wendel on
I agree with Mike about everything he said, and especially about, "in general, there seems to be an inverse proportionality between how good they are and the number of certifications they have.  What's important is not what they know, per se, but how good they are at figuring out how to solve problems, especially during an emergency situation."

I manage 250 servers professionally for the largest non-municipal (NYSE:PSD) power company west of the Mississippi and north of San Francisco.  And when we screw up, the lights potentially go out over a third of Washington State.

When we're hiring - and I know, because I'm the person who is the first person to read the incoming resumes - the LAST thing we care about is what certifications someone has.  In fact, certain certifications actually count against you, or having too many certifications.  It tells me you're fluff, and not work.

The one thing we care about most is what actual experience someone has in a large datacenter, with demonstrated competency running projects and software similar or identical to ours.  If they don't have at least two years with at least 50 servers, we toss the application right then.

After that, we're especially looking for three things: 1. the ability to deliver the goods no matter what (accountability/ownership), 2. the ability to see the big picture, as in, after appropriate training, could the other senior team members all go on vacation for a month, and know that when they come back, things will still be running and everything will be okay?  Can they/will they always speak truth to power?  (integrity/responsibility), and 3. and this is always a deal-breaker, do they fit really well into OUR existing team.  There are only 8-10 of us at any given time on the operating systems team.  We have each other's back.  In the past, there have been a couple of times when we've hired someone who didn't quite fit in or thought s/he was too good for the rest of us; we now take exceedingly great pains to pick real team players.

A great sysadmin makes up for a lot of failures in the datacenter.  Not to say you don't want to choose a good datacenter to host your server.  But if you don't have chemistry with the team who is going to host your server, I'd look elsewhere.

Caroline mentioned earlier - https://openacs.org/forums/message-view?message_id=171458 - that she's moving a site from ETP to BCMS this weekend.  That's the site I'm producing.  We'll be going live later this evening, so in the next day or so, Caroline and I will announce what we've been up to the past 2.5 months, and invite y'all to come take a look.

In the meantime, I can tell you that I host the site at www.zill.net, and I've been very satisfied with Patrick Giagnocavo's service and performance.