Forum OpenACS Q&A: Re: organize openacs.org!

Collapse
6: Re: organize openacs.org! (response to 1)
Posted by Mike Sisk on
About last night:

First, a few specifics: this machine is a Dell 2550 with a 1-GHz PIII CPU, 1.5 GB RAM, 2 16-GB SCSI3 drives in hardware RAID 1. We have 7 other machines just like this one.

This machine is still running the stock Red Hat 7.1SBE and kernel 2.4.3-6 (a Dell-specific version of Red Hat). This kernel has a known bug in the RAID controller code along with other problems.

I signed this machine up for an enterprise entitlement to the Red Hat Network so we can manage it with the rest of our machines. Unfortunately I couldn't install some updates until the kernel was updated.

We do most of our server maintenance that requires a machine reboot in the Saturday evening thru Sunday morning timeframe. Usually, for kernel updates on a new machine I drive to the datacenter and do the work from the console on an ACSII terminal in our cage. That's what I did last night.

Unfortunately, the new kernel went into a kernel panic. Instead of trying to figure it out I jotted down some notes and rebooted the machine into its old kernel. Another problem on this machine is that one of its drives is "sticky" -- on most of our Dells it only takes a few seconds for the SCSI devices to come up to speed and get past the SCSI bios "settle"; this machine takes about 3 minutes from a cold restart to get past the SCSI bios.

Another problem once the machine booted: daemontools was no longer running. This was rather perplexing since it worked on startup when we installed the machine two weeks ago. We couldn't find any trace of it in any of the startup scripts so we eventually just installed a new copy of daemontools in the DJB-specified location and fired the sites up.

That was the cause of the extended downtime last night.

Sysadmin 101 tells us that to improve security and reliability of a server you need to restrict the folks with root access. There are too many cooks on this machine.

We can do one of two things:

1. Let the community handle sysadmining of the machine and all we do is make sure a power-plug and Ethernet is plugged into the machine, or

2. Let us handle the sysadmining of the box along with the other 20 or so we have just like it.

I'd rather us handle the sysadmin of the box. You folks need to be working on OpenACS, not hacking on Linux.

Of course, we'll still allow sudo access to a restricted and community-approved list of folks -- we'll handle the basic sysadmin parts of the box and let you folks manage the applications.

Now, we need to get this box under control. It's been hacked on by so many folks over so long that it's a mess. It is also severely lacking in disk space.

(BTW, since the reboot last night and the new daemontools install the load on the machine as gone down remarkably. It used to have a consistent load of over 3, now it's below 0.5.)

I purpose this: sometime during the next few weeks we bring the services on the machine down. I'll then do a filesystem snapshot and put the contents of the entire filesystem on a NAS device. I'll then update the machine to Red Hat 9 (or whatever the community wants -- I just insist a Red Hat version so it'll play nicely with the Red Hat Network) and put in a new set of disks to bring it's capacity up to 50-GB. I'll setup qmail, daemontools and the other services the machine requires and mount the NAS partition with the old filesystem. Then someone else will need to install and configure Postgres, AOLserver and the sites.

Unfortunately, this is a lot of work. But I think it'll make things easier in the long run.

BTW, netsaint is no longer being started because it was sending it's data off to techsquare, who Ben was using to admin the box. We don't use netsaint (which isn't supported anymore -- it's now Nagios) and I haven't got it running under our monitoring yet.