Forum OpenACS Q&A: Re: big problems with mydomain.com?

Collapse
Posted by Tom Jackson on

The following explaination of the Mydomain disaster was posted on their "Problems" forum:

Ok, mydomain and namesdirect had some problems today. Here's a sem-technical, semi-reassuring account of what happened. Some details are sketchy since some cases didn't afford any data....

About 3am Pacific, a Denial Of Service attack/ HUGE influx if DNS queries bombarded our main co-lo facility in Seattle. In the following hours everything on that network became extremely slow making most of the services provided on that network (DNS,URL forwarding, Email Forwarding, websites, DB) appear unavailable or really slow to anyone outside the network. This kind of activity has happened before but never of this magnitude.

The unfortunate side effect of this activity is that it overloaded both the primary and secondary firewalls causing them to reset connections about every 2 minutes. Meanwhile, our senior network engineer was woken up and after having no luck with a remote fix headed to our co-lo facility. He arrived to find the firewalls rebooting under a large deluge of traffic. He couldn't even get information off of the firewalls about what was actually happening.

In the meantime, the downtime at our co-lo in Seattle caused all DNS to be directed to our east coast facility. The facility also was brought down by the volume in traffic. As we tried to diagnose what the problem was so that we could know what to cut off, the traffic just kept coming and the forums were on fire. The forums stayed up because they're hosted separately from all of the other servers. We couldn't even get into the mydomain website to post a notice about 'system problems'. After some conflict with the co-lo provider, finally at 5pm PST, they filtered out all traffic destined for the mydomain nameserver in the Seattle co-lo. This immediately enabled all services on that network to the outside world. While we cleaned up things, we discovered that the mydomain site and db had seriously crashed and had to be worked on. Hence the extra downtime on the mydomain site after some services appeared to be up. Unfortunately this problem caused a problem with email forwarding for awhile that was eventually fixed.

So that as they say is that. It was an amazing experience in community (the forums, customer calls, and customer visits), technology (trying to find a Cisco PIX 525 at 4:30pm is tough), and dealing with this phenomenon called the 'Internet'. A written apology probably won't suffice. We will always try to do better. If you have any questions/concerns/rants, please direct them to flash.

Unanswered is why all four name servers are on the same network, in the same facility, etc. Why did it take 14 hours to filter the traffic? Why was the Network Engineer still in bed 2 hours after the event started? Why didn't they inform the users who read the forum that the problem was a ddos attack?

Collapse
Posted by Petru Paler on

Why did it take 14 hours to filter the traffic?

No planning, I guess... it takes about 5 minutes to stop a DDoS -- the target is still losing connectivity, but at least it won't bring the whole network down.

Unfortunately, US-based ISPs don't seem to react to well to DoS attacks (usually because they have networks so big they're not affected themselves, and they don't care if one customer goes down), so to be prepared for something like this one needs to talk to them in advance and make sure the procedure is established (and do a test lockout of one IP address).