Forum OpenACS Q&A: big problems with mydomain.com?

I know this is not the best place to ask that but because mydomain.comn is quite a huge (Open)ACS site, if someone know something about what's going on there?
Collapse
Posted by Steffen Tiedemann Christensen on
You might want to look at http://216.110.167.117/viewtopic.php?t=1049

/stc

Collapse
Posted by Tom Jackson on

It is apparently a firewall issue, and the engineers are working on fixing it. I wish I knew more, I have a few domains with those guys myself, and I did write a lot of the software for that service. But how can one problem bring down all four dns servers plus the mydomain interface. You would think things would be spread out a little more than that.

Collapse
Posted by Tom Jackson on

The following explaination of the Mydomain disaster was posted on their "Problems" forum:

Ok, mydomain and namesdirect had some problems today. Here's a sem-technical, semi-reassuring account of what happened. Some details are sketchy since some cases didn't afford any data....

About 3am Pacific, a Denial Of Service attack/ HUGE influx if DNS queries bombarded our main co-lo facility in Seattle. In the following hours everything on that network became extremely slow making most of the services provided on that network (DNS,URL forwarding, Email Forwarding, websites, DB) appear unavailable or really slow to anyone outside the network. This kind of activity has happened before but never of this magnitude.

The unfortunate side effect of this activity is that it overloaded both the primary and secondary firewalls causing them to reset connections about every 2 minutes. Meanwhile, our senior network engineer was woken up and after having no luck with a remote fix headed to our co-lo facility. He arrived to find the firewalls rebooting under a large deluge of traffic. He couldn't even get information off of the firewalls about what was actually happening.

In the meantime, the downtime at our co-lo in Seattle caused all DNS to be directed to our east coast facility. The facility also was brought down by the volume in traffic. As we tried to diagnose what the problem was so that we could know what to cut off, the traffic just kept coming and the forums were on fire. The forums stayed up because they're hosted separately from all of the other servers. We couldn't even get into the mydomain website to post a notice about 'system problems'. After some conflict with the co-lo provider, finally at 5pm PST, they filtered out all traffic destined for the mydomain nameserver in the Seattle co-lo. This immediately enabled all services on that network to the outside world. While we cleaned up things, we discovered that the mydomain site and db had seriously crashed and had to be worked on. Hence the extra downtime on the mydomain site after some services appeared to be up. Unfortunately this problem caused a problem with email forwarding for awhile that was eventually fixed.

So that as they say is that. It was an amazing experience in community (the forums, customer calls, and customer visits), technology (trying to find a Cisco PIX 525 at 4:30pm is tough), and dealing with this phenomenon called the 'Internet'. A written apology probably won't suffice. We will always try to do better. If you have any questions/concerns/rants, please direct them to flash.

Unanswered is why all four name servers are on the same network, in the same facility, etc. Why did it take 14 hours to filter the traffic? Why was the Network Engineer still in bed 2 hours after the event started? Why didn't they inform the users who read the forum that the problem was a ddos attack?

Collapse
Posted by Janine Ohmer on
I have nothing whatsoever to do with mydomain.com, but was following the festivities last night (sort of along the lines of watching a car wreck :).

My impression is that they have four name servers, three in Seattle and one in the East.  The eastern one wasn't directly affected, but folded under the load when all the DNS traffic started going to it.

Their explanation doesn't quite work, since they don't say how the ended up with ACS errors at their site (and if I recall correctly there were even some ACS install messages at one point).  The only thing I can guess is that the load from the DOS attack caused their MySQL database (I think that's what I read it was) to eat it's shorts, and they had to restore everything.

Obviously not an "enterprise class" operation - but hey, it's free, so IMHO no-one should have expected any more.

Collapse
Posted by Tom Jackson on

This site uses Oracle, not MySql. The database crash is unrelated to the dns ddos attack, if anything this would relieve traffic on the site. At any rate, important points to be learned are:

  • that the quality of the service in an ideal situation has nothing to do with the internet today. Holding together a service like Mydomain.com requires a big investment in time on the networking front. If you don't have everything together, things can fall apart real fast.
  • when the database goes down on an installed version of ACS/OpenACS, why not bring up a maintenance page instead of the installation page, and send an email to an admin?
  • have a separate network connection to get to your machines in case of failure. Having a direct frontal attack on you network, should not affect your ability to login to your machines from the backend. I'm pretty sure the ISP wasn't affected by the ddos, otherwise they would have responded faster to the need to filter, implying that a backdoor could have been available.
  • tell your customers what is going on, asap. Otherwise you look like a jerk that either doesn't care or can't figure out what to do. The amount your customers pay for your service is irrelevent, the fact that they are using your service is enough to require you to think of their needs. If you can't do that, email the customers that you are getting out of the business so they can make other arrangements.
  • Collapse
    Posted by Petru Paler on

    Why did it take 14 hours to filter the traffic?

    No planning, I guess... it takes about 5 minutes to stop a DDoS -- the target is still losing connectivity, but at least it won't bring the whole network down.

    Unfortunately, US-based ISPs don't seem to react to well to DoS attacks (usually because they have networks so big they're not affected themselves, and they don't care if one customer goes down), so to be prepared for something like this one needs to talk to them in advance and make sure the procedure is established (and do a test lockout of one IP address).