Forum OpenACS Q&A: Advice for moving photo.net off a solaris system

photo.net is moving out of the hosting services so generously
provided by aD in the past and wanted to hear about what peoples
experiences with oracle installations/ACS were utilizing Linux.

We are currently configured with oracle on 5 logical drives that are
mirrored and raid1 with the dbf files on two of them. archiving on
another etc.  (oh also philip.greenspub.com is on one of the 5 drives
as well)

We were thinking of moving to oracle 9i + Suse Linux 7.2. Do people
have experience with this on large sites.  What raid configurations
have people been using with success on high volume acs installations.

We run a modified acs 3.2 with 8 aolservers configured with 10
threads each (as we have found this to be optimal). All of this runs
on the same box. Our cpus are maxed out often.

Our next steps are to put a front end box on. Move off the photodb
files to a separate server. After that we plan to run oracle
replacing the sun box entirely.

Any experience and advice you have would be appreciated. I'd heard
people were running acs installations with oracle in raid5 mode.

Collapse
Posted by S. Y. on

I dunno about RAID-5 and Oracle.

All of the Oracle DBA books I've read have very specifically said that RAID-5 is exactly the wrong technology for RDBMS. This includes the big fat Oracle8 Complete Reference doorstop and the O'Reilly Oracle DBA book. If I recall correctly, somewhere on this planet, Oracle's own documentation admonishes DBAs not to use RAID-5.

Of course, only weenies read the documentation.

RAID 0+1 (a.k.a. "RAID-10") -- which is mirroring & striping -- is considered the "best" RAID configuration for RDBMS systems. Of course, if you chuck huge amounts of hardware at a poorly designed system, maybe you can make it up...

I thought Philip covered RAID technology is one of his books/articles.

Collapse
Posted by S. Y. on

Visiting philip.greenspun.com and entering "RAID" into the search engine brought me to http://philip.greenspun.com/wtr/dead-trees/db-choosing.html which specifically states:

"The first thing you do is mirror all of your disks. If you don't have the entire database in RAM, this speeds up SELECTs because the disk controller can read from whichever disk is closer to the desired track. The opposite effect can be achieved if you use "RAID level 5" where data is striped across multiple disks. Then the RDBMS has to wait for four disks to seek before it can cough up a single row. So mirroring, or "RAID level 0", is what you want."

That's all I know, so if someone proves me wrong, that's not a big deal since I don't do this stuff for a living.

I admit that it's been a long time since I've personally handled Oracle running on a large site. I'm rather surprised to hear that your box is maxed out on its CPUs, though. What's eating up the CPU cycles?

My suspicion is that the 5 mbps going outbound plus the 8 aolservers are chewing up the cpu out of the one box. the box only has 4gb of memory and the sga is about 1.4 gigs. So moving the aolservers off to a front end box ought to help somewhat.

raid 10 is also very expensive and
for ACS for photo.net which is much more reads than it is writes it is unclear to me why raid5 might not work out well.

Collapse
Posted by S. Y. on
"My suspicion is that the 5 mbps going outbound plus the 8 aolservers are chewing up the cpu out of the one box."

Your "suspicion", eh? Well, what does "top" or whatever it is on Solaris (prstat?) say?

I'd like to hear what Mark D. has to say about the whole situation. Between a multi-threaded web server and a fairly scalable RDBMS running on a commercial UNIX, it would be interesting to see what it is that you think is "chewing up the cpu out of the one box".

I'm rather surprised that 5 mbps outbound is burning up CPU cycles on a commercial SMP UNIX box. A single-CPU SGI workstation from 1997 could pump out uncompressed D1 (a.k.a. CCIR 601) video (about a megabyte/sec.) without batting an eyelash on the processor side.

Before Philip initiated the image-uploading services, the photo.net Q&A forum database was rather puny (50 megs or so). If photo.net implemented file uploading in the filesystem, then yes, RAID-5 might be a good solution combined with RAID-10 for the database itself.

Of course, if someone decided to cram everything into the database, well, that probably would have earned that person a spanking from Philip. Without any more specifics from the photo.net team, it's really hard to recommend anything at all.

Good luck with your database migration.

Some issues about files and images. File Storage on our site currently handles 40.000 files of up to 10MB size. What we realized the hard way: Get it out of the DB and store it in the filesystem. Oracle does not seem to be good at handling this situation. So if you store the images in the database, you might think about getting them out there and onto a RAID 5 external diskarray.

P.S.: SuSE is the right choice for running Oracle though 7.3 is pretty sleek and allows you to select out of more journaling file systems (something you definitly want to use instead of ext2).

I am a little surprised that you are not looking into some kind of caching.  Maybe this is more trouble than it is worth, but I would think that even using squid as a front end to cache even some documents, even for say 15 seconds between refresh, would help a great deal - not to mention that you could put that squid box on a seperate machine as a front end - I would think that 256MB RAM on a P2/400 or more would do it, using cheap IDE disks even.

Further, switching both OS and DBMS level at the same time might not be a good idea.  Either switch to Suse with same revision or Oracle first, or switch to 9i first and then migrate to Suse.

I'll leave the hardware wrangling to others, but as far as software goes - Sussdorff & Roy moved aiesec.net from Solaris to Linux (with furfly assisting) and AFAIK there were no coding issues to speak of;  everything ran fine on Linux.  You do have to change things like the path to external programs like Perl, but that's really minor.

You will need multiple boxes, though;  the system we have aiesec.net on isn't really enough for it, and I believe that photo.net is even busier.

Collapse
Posted by Mike Sisk on
We have Linux/Oracle servers running both RAID 5 and RAID
0+1. We haven't seen any disk problems with any of 'em. Heck,
we have one backup server running software RAID 5 on a
366-Mhz Celeron with Maxtor IDE disks that transfers several GB
of data every night--its uptime as of today is 436 days.

My feeling is that with the type of sites any of us are running that
the RAID level issue isn't that big of a deal. With modern 10k+
RPM SCSI-3 disks and caching hardware RAID controllers it'd be
tough to max 'em out. You're far more likely to saturate the PCI
bus than the disks and at that point you'd have to be moving alot
of data around.

Now, if your building a data warehouse with a database in the
100+ GB range it's a different story. But with a database like the
AIESEC has at 6-GB, RAID 5 has been fine for us.

Here's a link some of you might be interested in:
http://staff.sdsc.edu/its/terafile/

It's about building a terrabyte-size IDE array with performance
tips and benchmarks on various RAID levels.

I see I didn't provide enough information for Sean:
I have done that analysis and seen that putting a frontend box and separate fileserver for photos should be a good move.

1> The 8 aolservers are usually at the top of the top list chewing up from 7% to 10% of the cpu each, and each one takes up 120-200Megs while it is running with a fair amount of caching. That is down from 300 - 400 megs from two years ago when rob was fixing all the memory leaks.  We experimented for months while there were huge memory leaks in aolserver while rob was fixing them and what we finally concluded was that you need to run maxthreads of 10 per server for the best performance with respect to the multi-threaded behavior on the server and aolserver driver. Rob and markD have looked at photo.net. I am open to more suggestions.

2> images on photo.net are stored in the filesystem, but each image-display request require hitting the database for permissions and subsequently spewing the file out of the filesystem but when it is doing that it isn't holding a db handle.

Malte I had read that oracle on reiserfs was performing 15x slower than on ext2.  Is that fixed in the 2.4.7 kernerl for suse linux.

Also are you using ext3 or reiserfs?

I assume mike that you are running ext2 on your setup?

thanks

raj

Mike's boxes are running ext2 unless he upgraded them when I wasn't looking.  There have been similar reports of ReiserFS killing PostgreSQL performance, too, so I've shied away from it.  Both RDBMS's really want as little operating system interference with disk reads and writes as possible...

My home desktop system is running ext3 and thus far I've had no problems with it.  I've got another box running PG and Oracle, though, and that box runs ext2 so I've made no effort to benchmark ext3 performance with either Oracle or PG.

As far as running several instances of AOLserver limited to 10 threads each, you may want to re-benchmark under Linux since threading and multiprocessing are areas where various Unices are extremely different in their implementation.

Here's another question ... how many x86 mobos support 4 GB RAM?  Your BX/i840/AMD760 type boards support 4 slots.  Currently DDR only comes in  256MB sticks though 512MB sticks will be available shortly (actually, might be already though Crucial's website doesn't list them as available  and I know Fry's doesn't have them yet though they will "real soon now").

So at the moment building or buying an Athlon-based server restricts you to 1GB, with 2GB being posible soon.

Crucial does have 1 GB SDRAM ECC PC133 sticks for $200 so yeah, OK, you can build up to 4 GB.  Supermicro offers SMP mobos that support this and run with modern socket-based PIIIs but the chipset's a non-Intel one called the ServerWorks ServerSet III LE.  Sounds more impressive than "GX", doesn't it? :)

So I guess at the moment going for this much RAM in an SMP configuration, at least, means PIII.

(I'm intentionally ignoring RDRAM solutions and P4 solutions)

I have been running Oracle + Reiser for a long time. I don't see any performance problem, in fact it may be quicker and it sure is a lot more reliable.

Another benefit is not waiting 30 minutes for fsck when you reboot (i.e. kernel changes).

Linux up to 2.4.5 (I believe) were screwy any way and kernel 2.4.9 had some VM performance problems. Use 2.4.12 and all should be well.

Yep, we're on ext2. Our current crop of production servers were
installed over a year ago and are based on Red Hat 6.2. The
kernels have been updated somewhat and have SMP and large
file system support but are still 2.2 based. At the time these were
setup journaled filesystems on Linux were still iffy. I've not seen
much of a need for 'em in a datacenter machine that's rarely
rebooted and isn't likely to lose power.

I do have several servers in the office running Suse 7.2 and a
journaled filesystem (can't remember which) and they've been
no problem. Not much load on them, though. I'm going to be
upgrading some servers here soon to a more modern kernel
and will probably use Suse instead of Red Hat.

Most of these servers have the max RAM they hold--2 GB for the
Dell 2450. Memory is so cheap at the moment--even the
registered ECC stuff--there's no reason not to fill 'em up.

These servers have been very reliable. Most have uptimes of
over 100 days (extra RAM installed the cause of the most recent
reboot) and several are over 400 days. I haven't been in too
much of a hurry to make any changes since they've been so
stable.

In answer to Don's question, Penguin Computing recently released the Altus 1240 (http://penguincomputing.com/) that consists of dual AMD Athlon 1.2 GHz and 3.5 GB of dram. It's in a 1U config and only supports 4 hd, but your DB may be on another box (I haven't studied this thread very well).

Although Penguin is reportedly having a rough time right now (at least that's the rumor on f**kedcompany) we have a couple of their servers and they run like a dream.

talli

If you are going to put your db on reiser in (which btw is what im doing) you are in essense doing double journaling. Which means for every write of the data to disk it require multiple head movements.

The approach I took in PG case was to create a separate partition for the log directory / XLOG and make this partition non journaling or ext2 in my case.  You may want to symlink this and I would also recommend a separate platter or disk for the XLOG files.  This allows the heads to remain ready at all times for sequential writes without having to move.

In general though if you are running a journaling database on top of a journaling filesystem I think it is beneficial to keep the database log on a separate non journaling filesystem and on its own platters.

Just my $0.02

ReiserFS only journals metadata, I thought?  So there's not really duplication of effort, as ReiserFS will be journaling changes to inode structure and stuff like that while the redo log will be journaling changes to records in the various database table files.

I could be wrong about what ReiserFS does, though.

Talli, that's interesting.  Perhaps the high-capacity DDR DIMMS haven't hit the end-user market yet but are available to OEMS?

Checking out dell brings me to the conclusion RAM is expensive. Adding 4 dimms would be $4400 which leaves one chunck at $1100. Taking Dons note about the Crucial 1GB dimm for $200, I think I know where Dell makes it money 😢.
Hi,

Maybe the best way to do it is to take it into stages.  Move aolserver to a front end box and still use the solaris box for Oracle.  Test the load, tune it.  Add RAM, raid maybe but maybe not since this will be front end boxes maybe just huge amount of RAM and get everything in the case.  After moving the aolserver then the Oracle maybe then moved to another boxed tune for RDBS, RAID 1, 1+0, etc.  Test everthing up... after a while upgrade Oracle.  Why upgrade Oracle?  Is there a benefit?

Since the images are on the file system maybe another web server may serve this up.  Not sure though it will depend on photo.net's way of serving its images.  But have a look at www.mathopd.org, its one of fastest http servers, secure and very low in resources. Maybe adding squid or somekind of a reverse proxy.  Dont know if this will actually have benefits I haven't really done this.  It must have.... hehehe.

Use Intel based servers based on ServerWorks chipset.  Although I am not sure if LE based chipsets has this dual SDRAM channels.  I am sure HE offers this, so its basically 2x the bandwidth than other Intel chipsets.  Stay away for BX and GX they are getting old.  For AMD based I think the Altus server mentioned earlier is good.  I believe that the design was from VA Linux then made its way to PenguinC.  There is an extensive review of it on www.anandtech.com.  Anand uses a similar server for this Cold Fusion driven site.  Also the FSB of AMD is better than Intel, I think its called EV6 that came from DEC Alpha.  From what I remember Intel FSB is shared amoung all the processors, so adding more processors will reduce the bandwidth on each processor.  AMD does not suffer from this problem as each processor has its own dedicated bus to the chipset.

Has anybody tried out XFS?  I have been using it on my machine, its ok but not sure though I have not benchmarked it.  I dont know if am fortunate or not.  They say that Reiser if good for small files and XFS is good for big files.  Anybody have numbers on Oracle + Linux + XFS? and Pg + Linux + XFS?

Malte - at least some of Dell's newer servers use the Intel i840 chipset, which only supports RAMBUS memory.  Though prices on this have dropped, the last time I checked (which admittedly was a month or so ago) it was still about 3x as expensive as SDRAM PC 133.  Dell's memory prices have never been good, but it's not as bad as you might think from the raw comparision.

Well ,,, it is as bad but not just because it's DELL, you gotta factor in the "RAMBUS sucks" numbers.

I remember reading an interview with some Oracle tech about running Oracle on newer Linux kernels and the code and advice they had/will be offering. Specifically, the 2.4 kernel supports raw IO, and the Oracle tech mentioned a speed up of something around 30%. Raw IO allows oracle to write staright to disk, bypassing the kernel buffer cache.

I haven't had a chance to try this myself so I don't know how easy this is in practice. One of the other nice things about the 2.4 kernel is the Logical Volume Manager. This would be useful in a situation where you wanted to run some journaled file systems, some raw partitions for the db, software RAID (which I've been very happy with, but my requirements don't match up to photo.net's...)

You can run more than one RAID level! Try a tablespace on a RAID 1 (mirrored) volume with raw IO for tables with data that must be preserved. Another tablespace for indexes (derived data, materialised views etc.) on a RAID 0 (striped) volume for extra speed and space efficiency. Use one of the journaling file systems for the rest, perhaps XFS considering the need for efficient io on large photos. At least RAID 1 for this, or RAID 5 if you have the disks and need the space.

You can play tricks with Oracle (9i comes with extra tricks...), for example if you carry through with the idea that indexes store non critical data and can be rebuilt, you can run that table space with the 'no logging' option. Can you even use a temporary tablespace?  Are your multikey indexes compressed? Are you using newer features like index organised tables, bitmap indexes, bitmap join indexes...

I thought the concurrency problem solved by running 10 nsd processes was suspected to be in the Oracle client libraries. It might be worth re-checking as Don mentioned, with 9i. The prefetch option was added recently to the 2.6 oradriver, which helps a little.  The driver currently doesn't support prepared statements, but (again...) a 30% speed improvement is mentioned in the Oracle docs when using this feature.

Keeping it all on one system would be less expensive, and easier from a management point of view. Modern PC harware should be way faster than your old Sun box...

FWIW, Dell will match Crutial.com's memory price if you tell them
you're not going to buy RAM with the systems. And they're willing
to deal nowadays--guess those servers just aren't selling like
they used to.

Also, I see Red Hat 7.2 is out with ext3 as the default filesystem.

ReiserFS only journals metadata, I thought? So there's not really duplication of effort, as ReiserFS will be journaling changes to inode structure and stuff like that while the redo log will be journaling changes to records in the various database table files.

While its true that ReiserFS only journals the metadata in the journal that is in the located in the 8th through the 8210th 4k block on the reiser partition the problem is not the duplication of effort but trying to optimize the head movement. For instance what would actually happen diskwise if you were doing a simple "INSERT A INTO B" statement using 7.0 on ext2 vs 7.1.2 on reiser:

  • File Data is Updated by flushing dirty page
  • Meta Data is updated (potentially at a separate place on platter)
VS
  • Reiser Log Updates Changes about to be made
  • Wal Log Updates Statements including meta data for WAL
  • File Data is changed in pg data directory
  • File MetaData is also updated in pg data directory
  • WAL checkpoints are issued at some point
  • Reiser Log is updated that transactions completed.

The point I am trying to make is that if you put the WAL files on a dedicated nonjournaling partition/platter then it reduces load on the rest of the FS and allows speedier execution. Seeks are expensive operations and the heads on the WAL files can remain where the data can immediately be flushed to disk.

I could be wrong but my initial testing seems to back up performance results and for more info people may want to check out these links:

As always I could be wrong ...
Well, I'm not an expert, but here's my take..
Are you going to take the hardware with you, or does that belong
to aD as well? Assuming you take it with you, I suggest:
Use the current hardware solely for Oracle. If I would be paying
$30/CPU Mhz or whatever the current oracle price is I'd make
damn sure nothing unnecessary is running on the db box taking
cpu cycles away from oracle. This has the added bonus of
increased security, you can put an additional firewall between the
webserver(s) and the oracle box.
First priority would then be getting a front end box for the
webserver. If you don't wanna play around get for example a dual
cpu 1U box from some big vendor, say IBM (who apparently has
good linux support). If you are a little more adventorous, the dual
AMD systems are apparently very good, and certainly give better
bang for the buck than Intel stuff. See for example
http://www.anandtech.com/IT/showdoc.html?i=1514 for a review of
one such computer (apparently very good indeed).

As for choice of linux distro, this week both Red Hat 7.2 and Suse
7.3 were released, and both of them are surely solid for server
stuff. In addition, Suse sells something they call "Suse enterprise
linux 7", which is supposedly more tested and includes 1 year
support for the price of $600. This distro is certified for oracle 9i,
in case you need new db hardware as well. All of these distros
have some journaling fs as default (Red Hat ext3 and Suse reiserfs
I think), which is a good thing. But for oracle, be sure to check the
possibility of using raw i/o, which I think is in the 2.4.x kernels.

It might also be a good idea to have another server for images and
static pages. Plain apache or maybe even the new in-kernel
webserver in linux 2.4.x would do just fine. As for hard disks. For
the webserver it probably doesn't matter so much. But as most
servers have built-in SCSI you might as well get scsi disks. Two
disks mirrored with software raid (if hardware isn't available of
course!) would give some redundancy against drive failure. But for
the db definitively go for raid10. If the db isn't absolutely huge, the
price difference between raid10 and raid5 is small potatos, when
you take into account the price of the rest of the system.

Carl's right -  I didn't consider the effects of head travel.

BTW PG 7.0 has a transaction log, too, that's fsynch'd with every transaction that does a write).  PG 7.0 also flushes every modified data page and fsynch's the table files after every transaction that changes data.

PG 7.1 only flushes the WAL and fsynch's it at transaction end - it doesn't flush data to disk nor does it fsynch the data file.  This is why inserts and updates run so much faster on PG 7.1 (data is flushed and table files fsynch'd every so often - by default about every five minutes - in PG 7.1).  If there's a crash, the WAL is used to restore the data that didn't get written to the table files.

So putting the WAL on an ext2 disk would get you the greatest performance gain, I should think.  That seems to fit with what you're telling us.