Forum OpenACS Q&A: organize openacs.org!
Our maintenance of an responsibility for openacs.org appears to be utterly disorganized. First I'll try to describe some of the symptoms of the problem, then I'll make some suggestions for improvement. Cases in point:
Over the course of the evening, a bunch of us, all long-time OpenACS participants, logged onto IRC, all saying original, witty stuff like, "Heh, openacs.org is down??".
Several of us have shell accounts on openacs.org. We could see that the box had recently rebooted, and that no AOLservers were running at all. Some of us (I forget who) made other comments about various things the were or might have been broken o the box. But basically, none of us could really troubleshoot much of anything, because we have no idea how the box is really set up (and it is more complicated than most servers running OpenACS websites, trust me), nothing is documented, etc.
Furthermore, even if we had been able to figure out what was wrong, we couldn't have fixed it, because none of us have sudo on the box. Much more damningly, none of us even knew who has sudo on the box! We sort of guessed, and some folks took it upon themselves to email other OpenACS people who - they figured - might be able to do something about the problem.
Note that outages like this aren't new. One happened just a little more than a week ago on the 18th, and there have been others off and on before then.
This time, I think one of the Furfly folks eventually logged in and fixed something. Note that this outage, and any related disorganization, is emphatically not their fault. They just got the box from OpenForce not long ago, and even if they wanted or planned to figure out, overhaul, document, and improve all the openacs.org stuff on the box, I can't see how they'd have had time to do it yet. And anyway, as far as I know, Furfly volunteered to host the machine and donate bandwith, not to provide free unlimited sysadmin support.
Here are some suggestions:
One, we really need a list of everyone who has sudo on openacs.org, so
who know who might be able to fix something. This is easy,
anyone who does have sudo should be able to give us that list
by looking in
/etc/sudoers. If someone sends it to me
I'll see if I can get it put up somewhere, maybe on the
Like it or not, you folks with sudo are the de-facto on call list,
until we come up with something better - see below.
Two, somebody needs to document just how the heck things are configured, and fix anything egregiously broken. (E.g., PostrgreSQL and all AOLservers really need to start up automatically every time the box reboots...) The most obvious person for this job may be whomever already knows how things are configured, because they've been maintaining at least some stuff on the box, unbeknownst to the rest of us, up till now. Failing that, if no such person can be found, or does not want or is not up to driving this to completion, I hereby volunteer myself as a fallback, and will attempt it in my spare time, if some appropriate leadership-type person gives me the authority and ability to do so.
Three, the above docs on how openacs.org works need to go up on the website somewhere, so that it's actually feasible to troubleshoot problems. (It's just stupid for every single knowledgeable person who might be able to help to have to start from scratch with trivial, basic stuff like, "Ok, just where is the damn AOLserver error log on this box??")
Four, perhaps we then come up with some more formal system of knowing just who is responsible for stuff, is available to fix problems (when?), etc. I imagine that once we have items 1 through 3 in hand, this should become much more feasible.
You are absolutely correct. I have sudo on the openacs box. Here are the list of people with sudo access:
- Arjun, Don, ts (dunno who that is), Yon, Josh (dunno also), E Lorenzo, Ben Adida, Dan Wickstrom, Jeff Davis, Janine, Lars, Peter M, and myself.
Arjun and Yon probably should be removed from that list for the time being. I think Erick Lorenzo worked for OpenForce as well, so he might need to be taken off for now as well. And we need to find out who "ts" is.
Some things have changed since the box was moved, but here's my understanding of how the box is setup:
It's a Dell running Red Hat with kernel 2.4.3 (needs to be upgraded). When it was under techsquare supervision, the machine was not being actively updated. I noticed that we had lots of outdated packages with published vulnerabilities. I contacted techsquare but not much was done that I could see.
About 6 months or so ago (maybe a year?) I manually downloaded a bunch of RPM packages and upgraded a bunch o' stuff. I don't think the box has been updated since. During my upgrade I broke CVS for a few hours (RPM changed all the config files without warning. Thanks RPM.)
A few months ago or so I moved all AOLserver instances on the machine to be under supervise.
Supervise used to be under /var/netsaint/bin along with other djb utilities. That directory no longer exists. It seems it has been replaced with the djb-blessed /command.
It seems netsaint is not being started on boot. The symlink from /etc/init.d/ is dead:
lrwxrwxrwx 1 root root 25 Jan 31 2002 S91netsaint -> /etc/rc.d/init.d/netsaint
I'm not fully versed on the djb utilities yet, so I don't know how svscanboot is supposed to be started and who is supposed to start it. I think there's some script that you can run off of init to start it and it'll handle the services under it.
We have PostgreSQL 7.1 and 7.2.3 running. 7.1 is there solely for dotlrn.org I think. We should get rid of that and upgrade to 7.2.4. Both are running out of /usr/local/pgsql and /usr/local/pgsql-7.2.3 respectively. Both are started on boot via init scripts.
PG 7.2 is analyzed hourly via a /etc/cron.hourly script I setup from another script. Databases are backed up to /var/DB_BACKUPS via /etc/cron.daily scripts setup by techsquare.
CVS is backed up to /var/CVSROOT_BACKUPS.
Besides all the things you've mentioned (all of which I agree with), I think it's imperative that we hook up the box under the red hat network foo so it is kept up-to-date. I notice that rhnsd is running, so I don't know if that's been taken care of by flurfly already (thanks for the work flurfly!)
I was just thinking, in my "step two" above, I volunteer to document stuff regardless of whether I end up doing any sysadmin understand/fix stuff or not. I can whip up a simple HTML doc with all the areas I think we need to fill in, add Roberto's info above, add what I know, etc.
This doc will a be directory (initially just one file) of dead simple HTML files, in the style (inititally) of some of the various CVS and Oracle docs I have on my personal site. It will be dead simple static HTML so that I and anyone else with CVS access to that directory can edit it as easily as possible, and so that it can easily be read offline.
On the box, I think the place to put it is
/web/openacs.org/www/doc/openacs.org". That will make
it show up on the
page of openacs.org, once it's committed and then checked out to the
production server. Is that the best place?
Seems I already have write access to that directory (I think CVS commit too, although I'm not really sure), so once I hear that putting the doc there is a good idea I'll get started on a draft and do it.
My hope is that I can get a document started, which will both provide some useful info (see Roberto's above!) and start laying out all the areas we need to fill in... And then that the document will eventually become the first line of reference for anyone maintaining openacs.org, and that all such maintainers will contribute to it as a matter of course.
thanks for putting together the OpenACS.org maintenance document - great initiative! I have bookmarked it now so that I can look at it next time I need to do some admin task on the box.
Minor feedback on the doc - I am missing from the sudoers list. Also, the document should mention who has root - Mike Sisk?
I think that we should make OpenACS.org a project under openacs.org/projects/openacs and link to the maintenance doc from there. Do we already have a ticket tracker for openacs.org? Instead of having a link to mailto:email@example.com at the bottom of each page, shouldn't we link to a Bug Tracker instance instead? It is obviously good if emergency instructions are not only served by openacs.org itself and the same probably goes for the Bug Tracker.
First, a few specifics: this machine is a Dell 2550 with a 1-GHz PIII CPU, 1.5 GB RAM, 2 16-GB SCSI3 drives in hardware RAID 1. We have 7 other machines just like this one.
This machine is still running the stock Red Hat 7.1SBE and kernel 2.4.3-6 (a Dell-specific version of Red Hat). This kernel has a known bug in the RAID controller code along with other problems.
I signed this machine up for an enterprise entitlement to the Red Hat Network so we can manage it with the rest of our machines. Unfortunately I couldn't install some updates until the kernel was updated.
We do most of our server maintenance that requires a machine reboot in the Saturday evening thru Sunday morning timeframe. Usually, for kernel updates on a new machine I drive to the datacenter and do the work from the console on an ACSII terminal in our cage. That's what I did last night.
Unfortunately, the new kernel went into a kernel panic. Instead of trying to figure it out I jotted down some notes and rebooted the machine into its old kernel. Another problem on this machine is that one of its drives is "sticky" -- on most of our Dells it only takes a few seconds for the SCSI devices to come up to speed and get past the SCSI bios "settle"; this machine takes about 3 minutes from a cold restart to get past the SCSI bios.
Another problem once the machine booted: daemontools was no longer running. This was rather perplexing since it worked on startup when we installed the machine two weeks ago. We couldn't find any trace of it in any of the startup scripts so we eventually just installed a new copy of daemontools in the DJB-specified location and fired the sites up.
That was the cause of the extended downtime last night.
Sysadmin 101 tells us that to improve security and reliability of a server you need to restrict the folks with root access. There are too many cooks on this machine.
We can do one of two things:
1. Let the community handle sysadmining of the machine and all we do is make sure a power-plug and Ethernet is plugged into the machine, or
2. Let us handle the sysadmining of the box along with the other 20 or so we have just like it.
I'd rather us handle the sysadmin of the box. You folks need to be working on OpenACS, not hacking on Linux.
Of course, we'll still allow sudo access to a restricted and community-approved list of folks -- we'll handle the basic sysadmin parts of the box and let you folks manage the applications.
Now, we need to get this box under control. It's been hacked on by so many folks over so long that it's a mess. It is also severely lacking in disk space.
(BTW, since the reboot last night and the new daemontools install the load on the machine as gone down remarkably. It used to have a consistent load of over 3, now it's below 0.5.)
I purpose this: sometime during the next few weeks we bring the services on the machine down. I'll then do a filesystem snapshot and put the contents of the entire filesystem on a NAS device. I'll then update the machine to Red Hat 9 (or whatever the community wants -- I just insist a Red Hat version so it'll play nicely with the Red Hat Network) and put in a new set of disks to bring it's capacity up to 50-GB. I'll setup qmail, daemontools and the other services the machine requires and mount the NAS partition with the old filesystem. Then someone else will need to install and configure Postgres, AOLserver and the sites.
Unfortunately, this is a lot of work. But I think it'll make things easier in the long run.
BTW, netsaint is no longer being started because it was sending it's data off to techsquare, who Ben was using to admin the box. We don't use netsaint (which isn't supported anymore -- it's now Nagios) and I haven't got it running under our monitoring yet.
This sounds like a good plan. On thing that would be helpful in the future when you need to reboot or take down the box is to visit the IRC channel, usually people stop by whenever openacs.org is not responding.
Thanks for all your help.
I may be doing that myself in the next few days on my new machine but no promises at this point.
I still think we should have a publically visible "How to maintain openacs.org" doc though. (Maybe Furfly already has some of that sort of stuff for their own and clients' use?) It can say at the top that Furfly is responsible for sysadmin stuff, call them, etc., or whatever. But we should still have a document with up-to-date, accurate info explaining everything someone might need to know about the box, about maintaining the website, about who is responsible for what, etc. Note that this is not at all just sysadmin stuff, although certainly a significant amount of it is, but rather, everything specific we need to know about the infrastructure to maintain, debug, etc. openacs.org. Comments?
It's reasonable that Mike would see it that way, since apparently he's the responsible sysadmin who'd just been handed someone else's big messy pile of toys he has to carefully organize, repair, and upgrade, but I personally doubt that the openacs.org box is messy because, "It's been hacked on by so many folks over so long".
In my time at aD and elsewhere, I've seen plenty of boxes that were under tight "control" (where control was defined as, taking away anyone else's ability to fix or change anything on the box), by "professional" sysadmins, but which were more or less a mess. Why? IMO, because those nominally in "control" were in fact quite disorganzied; never wrote anything down, no documents, no notes, no how to fix things in an emergency, no who to call, no internal how-to or best practices info, no delegation, little or no communication with anyone else, nothing. It was the disorganization, plus the resulting lack of any real transparency or specific accountability, that was the problem, not the lack (or presence) of so-called "control" over the box. IMNSHO.
We've discussed standards quite a bit, and this is an opportunity to establish an open document for maintaining a high traffic oacs site. Joel's already done a good job on this, too.
Don't want to push furfly to do this, since they have plenty of other work on their hands. But we can probably do this as a group.
For instance the openacs.org site itself. One thing I'd like to see happen very soon is that the code on the site and the cvs copy be synch'd because last I heard they're out of whack. People have hacked files in place, or so I've heard.
Other things I don't expect Mike/Janine to handle are things like maintaining the access control lists for CVS commit rights. It would be nice to have stuff like that written down.
And basic stuff regarding starting/stopping services on the box, as I hope Mike continues to take vacations and I would hope that Janine might at times join him.
But, Andrew, Mike's diagnostic of this box is accurate. It's never really been professionally sysadmin'd. I never saw any evidence that TechSquare did much of anything to the box. Roberto, myself, Ben and a few others have taken turns mucking with it in our spare time.
So, should I go ahead and commit my draft "Maintaining openacs.org" doc beneath the doc/ directory on openacs.org?
The best approach would seem to be to rebuild the server from scratch on a different box, then port over all of the content, then take the old box down. Of course that takes an extra box. How much of the content is static and how much dynamic? Can we go to core 4.6.2 or has that been patched up too?
How about a virtual journal, where anybody who touches the machine makes an entry? At a minimum each entry should have a name, a date, and a comment. Should that live as a file on the same machine, an ETP on that machine, an ETP elsewhere?
What is the backup-recovery strategy for the box?
I am running aolserver 3.3+ad13 + RH 9 + PG 7.3.x. RH9 comes with PG 7.3.x, although I think Lamar has made a rpm packages for RH9 7.2.4. So far it runs, but maybe if you are more conservative you should go with RH 8 or RH 7.3. I have to change some stuff to make my Java stuff run, I am not sure how affected is aolserver with the new threading stuff. So far aolserver seems to run fine. I did however did not try to run daemontools anymore since, I was to lazy to recompile it.
Your decision to migrate the setup is more work and I agree with your decision that it is better in the long run. I have taken this route a couple of times too. Wherein you inherit a mess, get the needed stuff then migrate to known clean setup.
Also those with root privs has to be few. Like you said too many cooks. Andrew did point out that some or most of those don't document things, etc. But I think if you know the cook is good then stick with only 1 and 2 backup cooks. Putting more cooks only makes things worse even good cooks, more poor cooks even makes it worse. Also I was able to setup things wherein developers have no root access and still be able to do what ever they want with their OpenACS sites. Something similar can be done in the openacs.org site. Each site has their own user, where the only the needed developers can access that user.
Atleast you got control of your own box, in my current situation I have no control of the box. The client box is a mess, although I would like clean it but I don't have time to go back to sys admin stuff.
I also think that once openacs.org is migrated to a good setup people will not care much of it. The real hardwork of sys admin stuff is in the start, if the sys admin is still doing work after setup then he/she did a poor job on the start.
Andrew: yes, I've had the opportunity to work on some of the old Arsdigita machines and know what you mean. And I know all too well the type of sysadmin you're talking about--we call that sort of sysadmin the BOFH (do a google search on BOFH to read all about it).
For my own sysadmin style I like to follow the Principle of Least Astonishment. That is I like to put things and run them in a manner which is least likely to confuse someone else that may need to sysadmin the box. This means that qmail and daemontools are installed in the default locations, even though I find DJB's directory structure to be highly bizarre. Normally I install as much as I can with rpm and leave things in their default installation directories. Services should all be setup so they're in chkconfig and controlled by the service scripts.
Generally I don't document much because you can query rpm to find out what was done, when, and where it is. And with the Red Hat Network and up2date you get Debian apt-get like functionality (i.e. "up2date postgres" will fetch the latest postgres and install it and any dependencies it needs).
Once the machine is setup and stable I don't expect to need much in the way of general sysadmining beyond occasional updates to various packages. I get daily reports on bandwidth, uptime, disk usage, and log file activity so I keep up on any problems that might develop.
Whatever Red Hat version you folks want is fine with me. Most of our servers are on either 7.2 or 7.3. But I have several on 8.0 and one machine on 9. It's probably best to stick with 7.3 or 8.0 for the OpenACS box to minimize risk.
I normally give warnings if extended downtime is expected, but I had hoped the upgrade for openacs.org would be a quick reboot and that'd be it. Besides, it's either Saturday night or Sunday depending on your location and you folks are all suppose to have lives and be away from the computer. ;) It's us sysadmin folks that get stuck working weekends.
I normally keep one spare server-class machine that can assume the identity of any of our production servers. My spare server at the moment is a Dell 2450 with a 733-MHz CPU, 1-GB RAM, and 50-GB of RAID 5 disk and Red Hat 8.
We could move openacs.org over to this box. However, Ben donated the current server to the community and I'd like to keep openacs.org on it if for no other reason than the box isn't furfly property and could be moved elsewhere if the community wanted to. We could move openacs.org to the spare server and move it back to the upgraded box but that doubles the work.
Backup recovery for the machine depends on what happens. Full filesystem backups are done nightly. If a machine has a massive hardware failure I just move the site to the spare server using the latest backups (from an online 1-TB disk array -- tapes suck). If a disk fails I have spares in the cage at Exodus and they can be quickly swapped in and the RAID array rebuilt on the fly. The Sun world spends a lot of time talking about bare-metal recovery--for Linux I find it quicker to just reload the whole OS from CD and look at it as an opportunity for an OS upgrade.
I just wanted to thank you for all your efforts and for managing our server is such a professional manner! I'm quite relieved to hear that you have such a firm grasp of the situation and solid plans laid out already.
If you do things this way you have no revision control if you change things (which makes people reluctant to change things). Also, you can easily do cvs diff on the whole tree to see exactly what the diff is. If OpenACS.org needs custom pages and code - fine, keep them in separate files and packages and do revision control on those.
If we need to change core code inline we are doing something wrong. If we can't support a site like OpenACS.org out of the box with OpenACS then I seriously think we need to reconsider which business we should be in. I like the Plone approach a lot, once you've installed it you have your own site that looks *exactly* like plone.org. That's what we should aspire to too.
Running OPenACS.org off the latest release is a great idea.
Jeff Davis suggested openacs.org should be a branch of the main CVS repository. We do make changes to the code, so version control of those changes is important. Using a branch will allow us to also easily move any bug fixes we perform on openacs.org back into the main code base. Running on the latest release is a great way to test (but not the only way we should be testing.)
THe main constraint to this has been organization of all the people we needed to make this change. I think waiting until the server is reorganized would be a good idea.
Can openacs.org send notifications regarding the use, prolonging, changing of a "maintenance window" by e-mail? IRC will work for some, only.
BTW: "Maintenance" notification (and strategies for emergency handling like harddisk failure, ddos) are useful to document for openacs sites in general.
using a branch sounds like a reasonable compromise to me. I'm don't quite see why we need to wait for server changes to make this change though.
I figured we might have some of the same people needed to make the changes to the code at the same time the server is reconfigured. That's the only reason I suggested it. If we can get it done sooner, no problem.
I bet this was originally an aD box. "ts" would be "TechSquare", who aD subcontracted out some sysadmin stuff in the early days. You can remove the privileges from that account.
But this seems like "I didn't do it, so I'll just re-write it" kind of thinking.
Peter, regarding openacs.org running off current code, yes, we really need to do that.
I am in finals right now so I can't help right now (or next week, when I'll be going to Brazil and will be without net access for that first week), although I'd like to help.
We can figure out when did branch off the openacs CVS and figure out which packages we can upgrade via APM and which might need special attention/handling (forums will, I think).
Do we really have to go through rebuilding the whole OS installation? I didn't think that was necessary.I'd rather not do it, but I do think it's necessary.
One of the two drives on this machine is slow to spin up, which means it probably has a bad bearing. It has a hot-swap backplane that can hold 4 drives but it only has 2 in right now. Drive diagnosics don't report any problems with the drives so I'd have a 50% chance of selecting the right drive to swap out.
I'd rather not gamble and replace both drives with 4 good ones. I'd reconfigure the RAID from a straight mirror of the two drives to a RAID 5 stripe over all 4. This will give us 50-GB of disk space without us having to spend any money since I have a supply of 18-GB drives I've swapped out of some of our servers when I upgraded them to larger units.
Drives for these machines ain't cheap; these are SCSI3 drives with SCA backplane connectors.
I can also use this opportunity to remove the Windows partition Dell sticks on all their servers from the factory to run their mostly-worthless diagnostics.
If it were my job to do, I would do it the same way. In fact, I have done migrations on RAID the same way, keeping one of the mirrors for archives and for an emergency "everything else broke" reboot/recovery disk.
The only thing I would recommend is that a specific time that is convenient to Mike be scheduled well ahead of time, with a prominent notice on the web page and a post made to all forums (so that everybody who gets notifications can be notified even if they don't visit the site regularly). Then we all know to expect the downtime, which will probably be several hours even if everything goes perfectly.
From reading his post, I am quite comfortable with his ability to perform this.
On another issue, I am of the opinion (having sysadminned for 15 years) that there be at most three people with administrative rights. These three people need to be in physically separate areas. These three people need root, either directly or via sudo. These three people need to be able to work well together and be able to agree on sysadmin style, etc. If Mike just wants to be the physical machine's admin, then I suggest that at most three core team members (or core team designees) have root access rights.
More 'cooks' than that is a recipe for trouble. No, I don't want it, either . I have root on too many machines already....
For specific admin access, others could be members of groups with access to directories and files for specific uses. But more than three roots can easily cause trouble.