Forum OpenACS Q&A: OpenACS scalability, and testing.

Collapse
Posted by Klyde Beattie on
How much traffic can an OpenACS site support?

My goal is to make a site that will take 100hits/sec and will still
stand up to a /.ing.

What testing has been done so far?

Has anyone had PG up for more than a billion hits? 100million?

How offen does the DB corrupt, and how hard is it to restore?

Basicly I want to know how long i will be able to maintain this site
(the hardware/backend) by my self (without paying anyone), working
less than 40hrs / week?

Thanks for creating such an awsome, free community.

Collapse
Posted by Don Baccus on
That's quite a site - Slashdot themselves only get 25 hits/sec
during an average day (they mentioned this after September 11,
because on that day they were getting 50/60 hits/sec and were having
trouble keeping the site up).

This site right here stood up to a slashdotting just fine when Ben
published his "Why not MySQL?" paper and it got mentioned on
slashdot.  I'm not sure what it's running on today, back then it was
a dual P400 that was also Ben's server for development.  It barely
broke out a sweat but there wasn't much DB activity going on.

The big thing with PG is to VACUUM your tables nightly.  PG's
storage manager is a non-overwriting one so tables grow as you
update rows and don't automatically reclaim space.  VACUUM, then, is
a bit like garbage collection.  VACUUM in PG 7.1 and earlier
acquires an exclusive lock on each table it VACUUMs so can make a
site very slow while it is in progress.  Unless you're content's
huge, though, a VACUUM only takes a few minutes a day.

PG 7.2 will have two forms of VACUUM, one the current model and the
other a lighterweight model that won't acquire locks, just touching
non-busy pages.  It won't shrink files but will allow recycling of
dead space at much less cost than a full VACUUM.

Sourceforge has been running on PG for about a year.  Tim Purdue
(you can get his e-mail from the phpbuilder.com, I think) can
probably give you some good information regarding durability of a PG
installation.

They're apparently switching to Oracle but that seems to be because
of their desire to sell an enhanced "Enterprise Edition" sourceforge
to corporate clients and they think that Oracle is the only thing
that will sell there.

Collapse
Posted by Patrick Giagnocavo on
The first point I would make would be that 100 hits / sec of a 1k page (not a big HTML file, no images) would be 100K/second, and with overhead that in itself would pretty much use up a T1 line.  Thus, you need to plan to have say double that for Slashdot (or more).  In reality many pages are bigger than that, so you are really talking about 5 to 10 mbps of network bandwidth.  That's a lot.

Some testing has been done, and the underlying software is pretty reliable.  Most sites that have heavy load seem to end up caching parts of pages, and "memoizing" expensive SQL queries (basically caching the results of an expensive SQL query by only running the query at most every x seconds).

I have done some stress testing of PG and have found it to be very reliable.  I haven't had problems with database corruption.

Maintaining the site, in my opinion, has more to do with changes that you need to make to stay current (add content, change HTML layout, etc.) rather than worrying about the server staying up.

Collapse
Posted by Stan Kaufman on
Don, in order to VACUUM safely, one needs to upgrade PG to 7.1 because of bugs in 7.0.3, right? Thanks for clarifying this!
Collapse
Posted by Don Baccus on
There were some subtle concurrency issues fixed "fairly recently"
(i.e. I don't remember the exact release) but they were very rare.

Certainly my live server didn't see them, nor did openacs.org, nor
any number of other sites (including SourceForge if you're right
about the version number because Tim originally switched to the PG
7.1 *beta*, brave man!)

I'm not trying to belittle the problem (and PG 7.1.x is much, much
better than Pg 7.0.x anyway, regardless of concurrency issues) but
it only cropped up for a handful of people over many years of use by
many, many users.

Things like this are why we keep nightly backups.  I've had Oracle
hose itself on me but never PG.  Not to knock Oracle, which is has
tons of features some folks need (intermedia and document filters
for MS formats, etc, that are reliable) and is after all very, very
stable.

My point is only that even very stable RDBMS platforms can choke and
whatever you use, be prepared with a backup strategy matched to your
needs (in my own personal space, once every 24 hrs).

Collapse
Posted by Tapiwa Sibanda on
Patrick Giagnocavo wrote:
Maintaining the site, in my opinion, has more to do with changes that you need to make to stay current (add content, change HTML layout, etc.) rather than worrying about the server staying up.

I would agree with Patrick on this one. In my experienc, the biggest concern for any web publisher is(should be?) how to maintain content current on the site. Unless you have full time persons providing enough content to generate that many hits, this is where your bottleneck will be.

Once you have your site up, it might be worthwhile to get a professional to give it a quick once over to make sure that everything is aOK. It might also be a good idea to call in the said pro every so often, just to give a once over... (we brush our teeth twice a day, but every 6 months pay the dentist a visit)

Once your site is up and running, and you have a decent backup system and policy in place, and you keep up with the software patches, the back end tends to take care of itself.

Collapse
Posted by Jeff Barrett on
I do not run OpenACS but I run a high volume web site that is backed by Postgres 7.1. We are ramping up our systems as I type this for a major roll out next year, 100-120 hits per second expected for 2-3 day duration bursts (I think a slashdot usually only lasts for 1/2 a day?).

On just the DB side (the web application is Apache and PHP for dynamic content) we have ripped out all queires but one and replaced then with cached files. The one querie/update tracks user sessions and that has been replaced with PL/SQL since the over head in our environment is in bringing data sets back and forth between the database servers and the 6 load balanced web servers (not to mention that PHP and Apache sucks at closing and maintaining database connections even with persistant connections, look at lingerd and TCP stack tweaking to fix that up a bit for Apache). So every page calls a minimum of 1 sql call with others less frequent pages doing up to 10 a page and once again as much work and logic as possible is done on the db side to minimize the size of data being passed between systems.

In our current testing we ran a pretty good use test program that simulated 500 concurrent users for three days with each user requesting a page on randomly every 8-15 seconds based on serveral use patterns and Postgres held up find (nightly vacuuming was needed). I think we grew one of the main tables from .5 million rows to 20 million during that test. Take images off the main web server and place them on a stripped down machine with a very efficent version of Apache installed (they have some other bare bones web servers for this same thing, but I have to use 'popular' software for webservers to keep some clients happy) and get a nice load balancer like an Alteon AD3 to do load balancing and redirect the image requests to that machine.

It has taken about a month for two people to set the servers up and another month is expected for code tweaking. Look at Above.net to colocate, they have some nice rates and they allow for like 36 hours of free not fined bursting to whatever level you like. 36 hours would be enough time to fit a /.ing in without having to pay for some huge bandwidth alotment.

I have had the DB crash a couple times usually when restructuring a table, I inherited a system with no foriegn keys, so I had massive clean up to do. So that is not what I call a production problem. When it did crap out I just dumped the table and loaded a backup back in rebuilt some indexes, functions and constraints and I would be back up in an hour or so. I don't think maintenance is that much of a time consuming action, automate your backups and vacums and you should be fine.

The time aspects does not seem to be that much of a problem for us and since the dot com bubble burst hardware can be gotten dirt cheep (I think we got that Alteon for 3,000, what a deal!)

Good luck with the /.ing. I would like to know how it works out for you.

-- Jeff

Collapse
Posted by Don Baccus on
Jeff, those numbers are really interesting, as we've not
stressed-test our toolkit in that manner thus far (though there are
several very busy sites running ACS/Oracle so there are some non-PG
datapoints).

Stress testing and automated regression testing for the OpenACS 4
platform are under development today, but until we generate some
numbers of our own using simulated users your numbers should serve
to show folks what's possible with a Postgres-based platform.

Thanks ...

Did you guys gather any data on how large your tables grew between
VACUUMs, and how long your VACUUMs took on average?

It would also be interesting to load test under PG 7.2 in order to
see how well the new "lazy VACUUM" works vs "VACUUM FULL".  The lazy
version won't remove dead trailing blocks from the data files, i.e.
it frees space for reuse but doesn't compact (which is one reason
why it only needs to lock a page at a time and therefore has much
less impact on system concurrency).  Thus one would expect that
datafiles will be a bit larger on an active site that exclusively
uses "lazy VACUUM" but AFAIK no one has solid data on how well or
poorly the new strategy will work.  There's been some testing but
nothing on the scale you're talking of (and I'm not talking about
"within the OpenACS community", here, but rather within the PG
community at large).

At least I've not seen any large-scale loading combined with
systematic "lazy VACUUMs" discussed on the PG hacker's list.

Collapse
Posted by Jeff Barrett on
One more piece can now be added to making a fast site with Any Database and PHP Apache, connection pooling with, in our case, SQLRelay. Gotta have connetion pooling, the one persistant connection per child with Apache PHP is/was killing our DB machine. So too add to the complexity of maintaining Apache to run high performance load balanced web servers add SQLRelay, Lingerd, kernel tweaks and config file tricks and you will be wishing for the simple days of AOLServer and Any Database!

Don:
I think our numbers will be changing dramatically once we get SQLRelay installed and working. We did not track the state of table size between vacuums, I did not even know that it was something to watch closely. They were run as cron jobs and we don't track the end time of the cron job, so I don't know how long the vacuums took.

As far as moving to 7.2 I don't see that happening soon, I had to kick scream and yell in order to get the powers that be to move to 7.1 and then we got bit by a new reserved word that we were using in a lot of our code. It was not that hard to fix, but the powers that be don't allways understand. I didn't even know that 7.2 had a different vacuum scheme available. I might have to sneak it in now.

I would be interested in looking at how you will be testing OpenACS platform. I don't know that much about OpenACS but I have learned enough to date that little changes in the test can have huge effects on their validity. Getting a 'tool' to load test a site in a representative way is pretty difficult. Where can I see what you guys are up too?

I guess if I did not spend my days trying to get Apache/PHP to work as well as AOLServer has for me in the past I would be able to reply quicker!

Collapse
Posted by Don Baccus on
I'll just provide a short answer in regard to "where we're at".  We've run into some show-stopping performance problems which have their basis in ACS 4 Classic's dependence on certain Oracle optimizations that aren't mirrored by Postgres.

Until we fix these problems - and they are fixable - scalability testing on the Postgres front, at least, is meaningless.

Your comment about needing a third tool (SQL Relay) in order to effectively implement connection pooling in the Apache/PHP environment is interesting, I didn't realize it was such a PITA in that environment.  The things we take for granted as AOLserver users :)