Forum OpenACS Q&A: Arsdigita slashdotted. Why doesn't the site handle the load?

Hi,

Mr Greenspun's article on managing software engineers has been "slashdot'd". I tried several times to click thru to the Arsdigita site's article URL from the slashdot article. No luck. Several posters on slashdot noted the non-response and made snide remarks about the Arsdigita site's architecture not being able to scale up under the load and serve request.

I'm concerned about this apparent failure. I want to understand the limits of the stuff I'm learning about. I would hate to build a site for some organization, telling them my proposed architechure (AOLserver&Co.) was better than the more popular Apache&Co. or whatever configuration and then have their site not hold up under a load.

I look to Arsdigita to learn from. I would think if anybody could make a site hold up under load with their architecture they could. Could someone please share what happened and why regarding the apparent failure of Arsdigita's site to hold up and serve request under the Slashdot load?

As a professinal mainframe programmer moving to web development, I would appreciate what the more experienced members of the community have to share about the real world/load limits and opperating parameters of the Arsdigita archetechure. Thank you.

Sincerly,

Louis Gabriel

OOPS!

I do know how to spell professional!

:)

Louis

I don't run arsdigita.com any more, but I used to, and I also built the highest-volume web site ever done with the ACS, away.com, so even though I'm not yet privy to what went on today, I'll offer some comments.

The thing about sudden surges in traffic is that they expose little configuration issues and glitches.  Arsdigita.com normally has pretty modest traffic in the scheme of things, and a site that doesn't get high traffic on a regular basis is unlikely to be optimally tuned and configured.  When the traffic does arrive, it's all of a sudden a problem.  Since the site seems to be running happily now, and it's the peak part of the day, I assume this is what happened, and that the problem has been corrected.

In any case, there's no rocket science going on here.  Anything that uses an interpreted language and a relational database is going to have performance characteristics similar to AOLServer+tcl+ACS+Oracle.  Get more things like application servers involved and the picture gets worse, not better.  You could improve performance by writing in C++ and using a database without transactions, but this is courting disaster, and it's just not necessary.  Given some tuning effort and adequate hardware our architecture scales just fine.

Adding to Mark's comments, earlier today it was refusing connections altogether (for me, at least).  This reeks of a configuration issue.  As Mark says, it is a site that normally gets little traffic since usually only aD folk, current and potential customers, and "us friends" make use of the site.  It could've been something as simple as the number of allowed AOLserver threads or Oracle database connections being set to a low value.
From internal communications, it looks like it was mostly a matter of some expensive but unnecessary request filters being turned on.  It's not surprising that these would have caused a problem with a high volume of traffic.

We need to get better at operating our own site.  In the past it's been a severe case of the cobbler's children going barefoot:  few if any resources were allocated to ad.com, which was run mostly as a hobby by whoever had some spare time.  There's a staff now, and they've made a lot of improvements. I'm sure scaling issues have just been bumped up on their priority list.  Scaling really isn't about what software package you install anyway: it's about diligence, careful planning and hard work on the part of the team building and operating the site.

Which / what-type-of request filters? Stock ACS 3.x or ACS 4 filters or others developed strictly for arsdigita.com?
I'm not sure of the details, but I think someone who is is going to post an explanation here soon.