We had another lengthy outage of openacs.org. I don't know what the
specific problem was, and for the purposes of this thread, I don't
really care, because the outage pointed to a different sort of
problem, which I want to address here:
Our maintenance of an responsibility for openacs.org appears to be
utterly disorganized. First I'll try to describe some of the symptoms
of the problem, then I'll make some suggestions for improvement.
Cases in point:
Over the course of the evening, a bunch of us, all long-time OpenACS
participants, logged onto IRC, all saying original, witty stuff like,
"Heh, openacs.org is down??".
Several of us have shell accounts on openacs.org. We could see that
the box had recently rebooted, and that no AOLservers were running at
all. Some of us (I forget who) made other comments about various
things the were or might have been broken o the box. But basically,
none of us could really troubleshoot much of anything, because we have
no idea how the box is really set up (and it is more complicated than
most servers running OpenACS websites, trust me), nothing is
documented, etc.
Furthermore, even if we had been able to figure out what was wrong, we
couldn't have fixed it, because none of us have sudo on the box. Much
more damningly, none of us even knew who has sudo on the box!
We sort of guessed, and some folks took it upon themselves to email
other OpenACS people who - they figured - might be able to do
something about the problem.
Note that outages like this aren't new. One happened just a little
more than a week ago
on the 18th,
and there have been others off and on before then.
This time, I think one of the Furfly folks eventually logged in and
fixed something. Note that this outage, and any related
disorganization, is emphatically not their fault. They just
got the box from OpenForce not long ago, and even if they wanted or
planned to figure out, overhaul, document, and improve all the
openacs.org stuff on the box, I can't see how they'd have had time to
do it yet. And anyway, as far as I know, Furfly volunteered to host
the machine and donate bandwith, not to provide free unlimited
sysadmin support.
Here are some suggestions:
One, we really need a list of everyone who has sudo on openacs.org, so
who know who might be able to fix something. This is easy,
anyone who does have sudo should be able to give us that list
by looking in /etc/sudoers
. If someone sends it to me
I'll see if I can get it put up somewhere, maybe on the
developer's page.
Like it or not, you folks with sudo are the de-facto on call list,
until we come up with something better - see below.
Two, somebody needs to document just how the heck things are
configured, and fix anything egregiously broken. (E.g., PostrgreSQL
and all AOLservers really need to start up automatically
every time the box reboots...) The most obvious person for this job
may be whomever already knows how things are configured,
because they've been maintaining at least some stuff on the box,
unbeknownst to the rest of us, up till now. Failing that, if no such
person can be found, or does not want or is not up to driving this to
completion, I hereby volunteer myself as a fallback, and will attempt
it in my spare time, if some appropriate leadership-type person gives
me the authority and ability to do so.
Three, the above docs on how openacs.org works need to go up on the
website somewhere, so that it's actually feasible to troubleshoot
problems. (It's just stupid for every single knowledgeable person who
might be able to help to have to start from scratch with trivial,
basic stuff like, "Ok, just where is the damn AOLserver error log on
this box??")
Four, perhaps we then come up with some more formal system of knowing
just who is responsible for stuff, is available to fix
problems (when?), etc. I imagine that once we have items 1 through 3
in hand, this should become much more feasible.