Forum OpenACS Q&A: Scaling, Redundancy, Multiple Machines...

1: Scaling, Redundancy, Multiple Machines...

Posted by MaineBob OConnor on 04/08/02 05:33 PM

I need some ideas for our plans to scale up an OpenACS 3x system. Currently we use one 700Mhz single processor to run: RH6/openacs/aolserver/postgresql and for mail, postfix. This is "Box 1".

There are times when our "uptime" load average goes over 2.4. This most often happens when we use openacs to send email to our ~15,000 users. And other times the load average goes up when serving many of our database backed webpage.

We are considering getting "Box 2" at the data center. This box would handle all of the mail and bulk mail send by aolserver from Box 1

Also, Box 2 could have the identical openacs/aolserver/postgres setup as Box 1, then in the unlikely event of box 1 failing, we could switch to box 2 by adding Box 1's IP to Box 2.

So, what are some ways to mirror the DB on Box 2. In my simplistic thinking, and because there are only a few places where inserts and updates are done, I could add a ns_log Notice "SQL1234: Insert ...." so it would show up in the log. The logs could roll at short intervals... and be automatically copied to Box 2. Box 2 could parse the log file and update mirror DB.... OR a better way???

We are looking for an inexpensive solution and not the megabux system that Jerry asked about in this thread:

Five 9s reliability, how would you do it?

Or would it be better to put the PG db on Box 2 and leave Box 1 with OpenACS/Aolserver? This senerio doesn't account for a possible failure.

Currently we are using and IDE drive and a box with 1/2 Gig Memory. I know that RAID would be possible but wouldn't some variation on my senario above be just as good and perhaps more cost effective?

As time goes on, our database gets bigger and gets relied on by more and more users, it is important that it does not become corrupted or cause long downtimes.

TIA
-Bob

2: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by Jonathan Ellis on 04/08/02 05:41 PM

unless you're definitely planning on some kind of failover setup, a single 2cpu box would be both easier to administer and cheaper at the datacenter than two single-cpu boxes, and won't break the bank for hardware costs. (I recently upgraded from a single cpu box to a dual GHz p3 for under $400.)

If you do decide to do some kind of poor man's replication with the logfile, it would be substantially easier to just turn on debug in your nsd config. pg driver will then log all statements. (Unfortunately for your purposes this means queries too but I imagine you could just grep for insert | update...) This would be far easier than tracking down all dml and adding a log statement with each.

3: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by David Walker on 04/08/02 05:54 PM

Just install AOLServer and Postgres on the 2nd box, make sure it works but
point it to the postgres server on the 1st box. Make sure the 1st box backs
up the database to the 2nd box daily (You don't have to restore it, you just
need a backup of it on there.)

Now you have the 2nd box that can handle your mailing load and take over
some of the web load and, if box 1 dies you can restore the database backup
that is already on box 2, start postgres, point aolserver to it, and you're back
up and running with minimal downtime and the loss of less than a day's
worth of data.

4: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by MaineBob OConnor on 04/08/02 07:51 PM

Jonathan: ...a single 2cpu box would be both easier to administer and cheaper...

Does a 2-cpu box work just like a 1-cpu box, only faster or are their other benefits that linux and processes can use such as specifying cpu-1 for process x and cpu-2 for process y? And does this work with postgresql or aolserver or postfix?

David: Just install AOLServer and Postgres on the 2nd box, make sure it works but point it to the postgres server on the 1st box.

What do you mean, "point it to"? Does this mean that both databases get updated at the same time?

David: ...2nd box...take over some of the web load...

How does it take over the load? I assume that it is only a backup and not serving web pages.

David: ...and you're back up and running with minimal downtime and the loss of less than a day's worth of data.

or less if I do more frequent backups... Thanks...

-Bob

5: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by Patrick Giagnocavo on 04/08/02 08:19 PM

When you say that your "uptime" average goes over 2.4, you are not really determining what the problem is.

Best to run "vmstat 5" to see whether the processes are CPU-bound, IO-bound, or whatnot. The vmstat man page has more info on what the numbers mean.

If the actual CPU usage is generally low, then the answer is simply to get a faster disk. You can mount /var/spool/mail on the faster disk and separate the disk operations for email from those for the web pages and database.

6: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by David Walker on 04/08/02 08:40 PM

When I say point to I mean configuring AOLServer on that box to connect to to Box 1's Postgres database rather than the one on localhost. In this configuration only the database on Box 1 will get updated. That is why the backup to Box 2 is needed.

It can serve pages if you want it to if they are using the same database. Also you can make sure you connect to it to send out your mailings. The mailing work will be done by Box 2 but it will still use the database from Box 1.

And yes, more frequent backups means you'd lose less data in case of catastrophe.

7: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by Jonathan Ellis on 04/08/02 09:27 PM

no, 2 cpus is not the same as 1 faster cpu. yes, the OS will split processes among the CPUs where it can. postgres plays quite nicely this way (each backend is a separate process); don't know for sure about nsd since postgres is typically 90% of my cpu and nsd only a couple %.

david, if postgres is his bottleneck, having two servers hitting the same db won't help things which is why I suggested smp. but you're probably right that it's the mailing slowing things down.

8: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by MaineBob OConnor on 04/08/02 09:48 PM

Ok, I'm beginning to get the picture... If we scaled a bit further up we could have 2 or more aolserver (singleprocessor) boxes that connect to one smp box with the PG db on it. And up front, we'd need a "load balancer" to select which aolserver box to send the requests to....

And how does the load balancer work? Is it another box? Does each
aolserver box need its own IP address or maybe they get local
ip's via NAT by the load balancer....

Hey, I'm just imagining stuff here... Those that know can help me
understand. THANK YOU.

-Bob

9: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by Jeff Barrett on 04/08/02 10:57 PM

Concerning the load balancer, I would suggest the use of an Alteon AD3, very nice product. Check out ebay and busted dot com sales for one real cheap. This type of load balancer works by grouping IP addresses behind one address, you can have several groups with hundreds of servers behind each group. The group consists of one public IP address and the IP addresses of machines that are load balanced for that address. Now you can do some very slick things with a load balancer more then just distributing the load for all HTTP requests to a pool of servers. You can do things like say "index.html is served by machine1 (or group of machines) while all adp pages are served by machine2 and all .jpg, .gif etc are servers by machine 3". You can even go so far as to send all people with a certain cookie value to one pool of servers or another based on their value (the better the customer the faster the service or better graphics etc). The load balancer can also do some primitive detection of machine failures and take a machine out of a pool. It will also allow for overspill (you know you can handle 100pps on the 4 load balanced servers) by taking the extra load and handing it off to a stripped down HTML server or even a backup group of servers. (Side note do not try and load balance firewalls with this machine, I heard it is incredibly difficult.)

I would suggest having two to three network cards in all machines that are to be load balanced. The first network would be the public side that is reached via the load balancer and the second would be for communication between servers and to the DB server. The third nic would be for a failover situation with the load balancer (but if you want to do failover I suggest you just pay the cash and hire a network guru to figure that all out, that is where my knowledge ends and I personally pick up the phone and start signing checks.) I would just use some cheap cisco 2950 switches between the alteon and each group of servers and something along the same strength between the balanced servers and the DB machine(s).

When I did this setup last I was not using aolserver but apache and PHP, be thankful you are runing aolserver, so much simpler to configure and streamline. With apache I had to add lingerd and SQLRelay to run more then four webservers in front of a modestly sized Postgres DB machine and that was only getting me 60-80pps with a minimum of one DB query per page.

One thing I would also recommend is getting a standard machine for your web servers so you can cookie cutter their installation and know how long it will take your supplier to get a new machine to your colocation facility. The biggest nightmare I have in scaling up is adding machines due to delays from manufaturers. Remember redundancy means redundant headaches so try and get some more people to help, the workload to implement these things needs load balancing as well.

10: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by Jeff Barrett on 04/08/02 11:01 PM

"Does each aolserver box need its own IP address or maybe they get local ip's via NAT by the load balancer.... " I forgot to answer that one. The Alteon AD3 does simple routing as well from what I remember so you can create another subnet for each group of servers if you want, it is just a more complicated configuration on the part of the alteon and thus harder to maintain two load balancers in a failover situation. If you can get the IP addresses in one big block or two that can make life easier, but it is not nessesary.

11: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by David Walker on 04/08/02 11:17 PM

We are using a load balancer as well (Cisco Local Director) but you can also
spread out the load using round robin dns. The load balancer will
automatically send requests to the machines that are up. Round robin dns
will blindly send requests to all your machines.

12: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by Stephen . on 04/08/02 11:47 PM

It will cost a couple hundred bucks and and hour or so of your time to upgrade your two year old hardare to handle these minor load spikes. Save some hair pulling and leave the load balancing to Google, enjoy the rest of the week...

13: Response to Scaling, Redundancy, Multiple Machines... (response to 1)

Posted by Barry Books on 04/13/02 06:00 PM

I second the previous comment. I removed the load balancers from my setup and my site is now faster and more reliable. If your site is too slow get faster hardware not more hardware.