Forum OpenACS Q&A: My production site is down

Posted by Jade Rubick on 01/17/03 08:56 PM

One of my Aolserver instances seems to have died at 6:00am this morning. The last thing that happened:

209.237.238.163 - - [17/Jan/2003:06:19:21 -0800] "GET /robots.txt HTTP/1.0" 404 
538 "" "ia_archiver"
209.237.238.164 - - [17/Jan/2003:06:24:28 -0800] "GET /robots.txt HTTP/1.0" 404 
538 "" "ia_archiver"
209.237.238.164 - - [17/Jan/2003:06:24:28 -0800] "GET /news/item?item_id=3106 HT
TP/1.0" 200 9431 "" "ia_archiver"
209.237.238.159 - - [17/Jan/2003:06:24:36 -0800] "GET /robots.txt HTTP/1.0" 404 
538 "" "ia_archiver"
209.237.238.159 - - [17/Jan/2003:06:24:37 -0800] "GET /news/item?item_id=3120 HT
TP/1.0" 200 9644 "" "ia_archiver"
209.237.238.161 - - [17/Jan/2003:06:26:06 -0800] "GET /robots.txt HTTP/1.0" 404 
538 "" "ia_archiver"
209.237.238.161 - - [17/Jan/2003:06:26:07 -0800] "GET /news/item?item_id=3204 HT
TP/1.0" 200 9381 "" "ia_archiver"
209.237.238.162 - - [17/Jan/2003:06:28:37 -0800] "GET /robots.txt HTTP/1.0" 404 
538 "" "ia_archiver"
209.237.238.165 - - [17/Jan/2003:06:29:29 -0800] "GET /robots.txt HTTP/1.0" 404 
538 "" "ia_archiver"
209.237.238.163 - - [17/Jan/2003:06:30:20 -0800] "GET /news/item?item_id=3206 HT
TP/1.0" 200 9243 "" "ia_archiver"

It looks like it got hit by a robot. I'm not sure why it would have gone down, though. No errors in the error log that I can see.

I have it set up using /etc/inittab, and it was working fine up until now.

I need to get this back up and running right away. Any suggestions? I think I might just restart the entire server, and see if that helps..

2: Re: My production site is down (response to 1)

Posted by Jade Rubick on 01/17/03 09:04 PM

It looks like my error log had gotten up to 2G in size, so deleting that probably will help.

Shouldn't init respawn the process now? I seem to remember that it takes about five minutes... We'll see.

Perhaps rolling the logs should be put into the setup guide.

3: Re: My production site is down (response to 1)

Posted by Roberto Mello on 01/17/03 09:05 PM

Interesting. Have you tried restarting AOLserver yet?

Instead of using inittab, you might want to look at monit (http://www.tildeslash.com/monit/). monit is "a utility for monitoring and managing daemons or similar programs running on a Unix system. It will start specified programs if they are not running and restart programs not responding"

There's also djb's supervise, but as with every other djb tool, it's non-free, hard to setup and non-intuitive to understand.

-Roberto

4: Re: My production site is down (response to 1)

Posted by Jade Rubick on 01/17/03 09:11 PM

It's back up again.

Roberto, if it's an easy process to contribute to the documentation, I'm interested in helping out. I think this should be at least mentioned in the documentation (maybe it is, and I didn't see it).

5: Re: My production site is down (response to 1)

Posted by Tom Jackson on 01/17/03 09:47 PM

Roberto Mello wrote:

There's also djb's supervise, but as with every other djb tool, it's non-free, hard to setup and non-intuitive to understand.

I'm sorry that I can't resist. Just download and install djb's daemontools package. It is a one or two minute install process. Then to start AOLserver you create a file called run in some directory. I use /web/control/servername. The run script should set any environment variables you need and then start AOLserver. Here is a copy of mine, which works for both pg and Oracle:

#!/bin/bash

export ORACLE_HOME="/ora8/m01/app/oracle/product/8.1.7"
export ORACLE_BASE="/ora8/m01/app/oracle"
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$ORACLE_HOME/ctx/lib:/usr/local/pgsql/lib:$LD_LIBRARY_PATH
export PATH=$ORACLE_HOME/bin:$ORACLE_HOME/ctx/lib:$PATH
export ORACLE_SID='ora8'
export ORACLE_TERM='vt100'
export ORAENV_ASK=NO
export NLS_DATE_FORMAT="YYYY-MM-DD"
export LD_ASSUME_KERNEL=2.2.5
sleep 4

exec /usr/local/aolserver/bin/nsd -it \ 
 /web/control/iunicycle-oracle/iunicycle-oracle-nsd.tcl -u \
 iunicycle -g web

Then you just link the directory to the /service directory:

ln -s /web/control/servername /service

Your webserver should start in a few seconds.

If your server seems to not startup, or constantly restarts without printing to the error.log why it is dying (usually a startup file syntax error), you can use ps axww|grep readproctitle to show the last 400 chars of output.

Start and stop, restart of your server:

svc -u /service/servername
svc -d /service/servername
svc -k /service/servername
svc -o /service/servername ;# start once, don't restart

Getting a status on how long the server is up/down:

svstat /service/servername

My opinion is software is never intuitive, but it can have simple commands and good documentation.

6: Re: My production site is down (response to 5)

Posted by Roberto Mello on 01/17/03 11:03 PM

I've installed AOLserver and OpenACS many many times before, so I'm quite good at it by now. Does that mean it's easy to install? Not really?

Does that make daemontools less non-free? No.

djb tools, while of high quality software-wise (or so I heard, I only briefly read the code for qmail and he still uses K&R syntax) like to install its things in non-standard locations, hence the "hard to setup and use" part of my post.

As long as there are free secure alternatives, who try to obey to standards, I'll stick to them.

-Roberto

7: Re: My production site is down (response to 1)

Posted by Don Baccus on 01/18/03 04:34 AM

Here's a server log roll script that works from me. There's actually a more general one floating around that I simplified to make this one. It sits in cron.daily and works fine (it's hardwired to one of my server instance names).

#!/usr/bin/perl

## Sends AOLserver an HUP signal, so it will roll it's logs. Takes as its
## only argument the name of the server to roll.

## modified from the version in ACS 4

$ENV{'PATH'} = '/sbin:/bin';

## get the PIDs of all matching servers
open(PID, "/bin/ps -efwww |") || die $!;
my @pids;
while (< PID>) {
  next unless /^\s*\S+\s+(\d+).*nsd.*\/birdnotes\.tcl/;
  my $pid = $1;
  push(@pids, $pid);
}
close PID;

print "Rolling ", join(" ", @pids), "\n";
kill 'HUP', @pids;

For this to work you have to have the AOLserver parameters "roll server log on SIGHUP" set. You can roll the access lo g from within the AOLserver .tcl config file. Both these are documented in the AOLserver documentation but I agree it would be nice to have something written up in our "set up your world" docs. Just as we've done with my simple backup scripts.

(there's a space after the less-than sign before the PID above that needs to be removed)

9: ns_logroll (response to 1)

Posted by Andrew Piskorski on 01/18/03 06:31 AM

Ah, the eternal log rolling question raises it's head yet again. :) Sending a SIGHUP like Don says definitely works (and I used to do it that way), but the best (simplest, easiest, cleanest) way is to use ns_logroll. Here's how I do it on one of my old systems:

ad_proc dtk_roll_server_log {{}} { } { 
   set proc_name {dtk_roll_server_log} 
   # This param controlls how many backups of the server log to keep, 
   # default is 10: 
   #   [ns_config -int "ns/parameters" maxbackup] 
   ns_log Notice "$proc_name: Rolling the server log now..." 
   ns_logroll 
   ns_log Notice "$proc_name: Done rolling the server log." 
} 

ns_schedule_daily 00 00 dtk_roll_server_log

For OpenACS 4.6, that will work as is, but you should use ad_schedule_proc instead of calling ns_schedule_daily directly, like this:

ad_schedule_proc -schedule_proc ns_schedule_daily \
   [list 00 00] dtk_roll_server_log

For yet other alternate ways, see these posts (and the threads they are a part of) from back in July and January 2002.

8: Re: My production site is down (response to 1)

Posted by Samir Joshi on 01/18/03 07:05 AM

Something similar happened to our site in early beta, twice in a row, 20 minutes apart. I believe I should be ablet to reproduce it under heavy load, but haven't got opportunity to simulate it in last two weeks. Could not post it earlier, but here is the log excerpt :

Everything normal ...
....
203.163.150.198 - - [06/Jan/2003:19:49:30 +0530] "GET /dotlrn/admin/user-new-2?user_id=20202 HTTP/1.1" 200 5905 
203.163.150.198 - - [06/Jan/2003:20:00:58 +0530] "GET /dotlrn/classes/maths/calculus/f-y-b-sc-maths/?folder%5fid=21188&n%5fpast%5fdays=99999&page%5fnum=2 HTTP/1.1" 302 389
....
Everything normal again...

The server stopped responding after 19:49:30. After 8-9 minutes I rebooted the machine in panic. The debug mode is on, but there is no error displayed there and everything looked normal untill the last entry in the access log.The surprise entries ( not surprising for the log-content but due to timestamps and the delay between them )in the error log are :

Everything normal ...
...
[06/Jan/2003:19:50:05][2554.3076][-conn0-] Debug: PgBindCmd: query with bind variables substituted =
    	select file_storage__new_file (
        	'NEW',           	-- title
        	'20281',          	-- parent_id
        	'19660',            	-- creation_user
        	'203.163.162.234',        	-- creation_ip
		true			-- indb_p
		);

[06/Jan/2003:19:50:05][2554.3076][-conn0-] Notice: Querying '
    	select file_storage__new_file (
        	'NEW',           	-- title
        	'20281',          	-- parent_id
        	'19660',            	-- creation_user
        	'203.163.162.234',        	-- creation_ip
		true			-- indb_p
		);'
[06/Jan/2003:19:58:21][2554.1024][-main-] Notice: nsmain: AOLserver/3.3.1+ad13 stopping
[06/Jan/2003:19:58:21][2554.1024][-main-] Notice: nssock: triggering shutdown
[06/Jan/2003:19:58:21][2554.1024][-main-] Notice: serv: stopping connection threads
[06/Jan/2003:20:00:06][1276.1024][-main-] Notice: nsmain: AOLserver/3.3.1+ad13 starting
....
Things back to normal again...

That is to say, though no new HTTP connection was logged after 19:49:30 , at least server logging thread was working upto 19:50:05. After that despite many requests from clients, there is no HTTP access log entry / debug log. But when I rebooted the machine, AOLServer logged 'AOLserver/3.3.1+ad13 stopping' message at 19:58:21.

There is no error logged in the Postgres log. I could not think of what might have gone wrong and what happened between 19:50 to 19:58.

Many thanks in advance for clues, hints and help !

10: Re: My production site is down (response to 1)

Posted by Jason Khong on 01/18/03 02:15 PM

ns_logroll seems like a very simple, elegant solution. Quite surprised that it hasn't been more widely publicized :)

Are there any known drawbacks to using ns_logroll instead of SIGHUP? (a.k.a. which should I use?)

11: Re: My production site is down (response to 1)

Posted by David Walker on 01/18/03 02:30 PM

I used both by accident on one system. Had a little log file covering midnight to 4am (cron.daily runs at 4am) and a big log file for the rest of the day.

I don't think there is any major difference. If anything ns_logroll is more precise since you are telling the server exactly what you want while with the other method you are just telling it to HUP.

12: Re: My production site is down (response to 1)

Posted by Tom Jackson on 01/18/03 09:18 PM

I am not sure that Monit actually monitors anything except the existence of a pid file. This forces the user of Monit to create a startup script to figure out what the pid of a process is and write it to a file. But Monit can't guarantee that the pid actually is related to the process you want to monitor.

Monit also does a bunch of other things only slightly related to controlling the daemon process. Configuration is not easy, since it is different for each daemon you want to setup.

Monit seems to suffer the same problems as processes controlled with init.d files. Check out DJB's faq on the subject. One deficiency in svc is the inability to control the order in which services are started or stopped. This is what init.d does well. Of course you could use daemontools svc to control processes from your init.d files as well.

It is difficult to pick something and call it a standard. What is standard on Linux, doesn't work on FreeBSD. DJB software runs from the same location on every unix type system, he was forced to create his own standard for his software because of the lack of a standard between platforms. One really difficult standard is a definition of 'free'. Both GNU software and DJB software come with a Copyright. GNU, I guess, calls their stuff 'free', using their own definition of the term. If I were to guess DJB uses his copyright to protect the software he writes from hacking by less informed individuals, where the result would be complaints that his software doesn't work or has bugs. If that makes it un-free using the GNU definition, so what? I noticed that OpenACS.org doesn't mind using .gif images, which are un-free, how can OACS be 'free' if it contains .gif images (at least 194 of 'em)?

13: Re: My production site is down (response to 12)

Posted by Jeff Davis on 01/18/03 09:30 PM

For people willing to use DJB tools you should also take a look at multilog which can roll logs based on size (also worth taking a look at is dqd_log by Rob Mayoff which lets you use multilog for aolserver access logs.

14: Re: My production site is down (response to 12)

Posted by Roberto Mello on 01/19/03 01:04 AM

"Monit also does a bunch of other things only slightly related to controlling the daemon process. Configuration is not easy, since it is different for each daemon you want to setup."

Configuration of supervise is also not easy, since each daemon needs its own start script. I don't have that much experience with monit to be honest, and I never meant this to be a "monit vs supervise" war. Use whatever you like best. I just gave my personal opinion that djb tools are hard to setup and use.

"DJB software runs from the same location on every unix type system...

Overgeneralizations are dangerous. This will only be the case if you install it the djb way. It doesn't mean that it'll always be the way it'll happen. There are packages out there to build binaries out of djb tools and install them in more sane locations.

"If I were to guess DJB uses his copyright to protect the software he writes from hacking by less informed individuals"

What you did not say is that it also discourages very informed individuals from improving those tools, since their modifications have to be distributed as patches. The result would be more collaboration and better software for everybody, unless you believe djb is some sort of godly entity that is perfect and whose work is impossible to improve.

By "free" I meant "I can change it and redistribute my changes in binary and source forms". If you really need an answer to your "so what?" question, then I suppose (and I mean no offense with this) that you're missing the point about open source/free software, IMHO

The .gif's in OpenACS will eventually be replaced. This is a left-over. The images themselves are not un-free. The LZW algorithm, which is used to encode those images, is covered under patent.

-Roberto

15: Re: My production site is down (response to 1)

Posted by Roberto Mello on 01/19/03 02:24 AM

Incidentally, I'd like to mention that I find supervise to be unreliable.

paje, the #openacs bot, is monitored by supervise on the openacs.org box. While it works well most of the time, supervise will just fail to restart paje when it dies sometimes.

-Roberto

16: Re: My production site is down (response to 1)

Posted by Tom Jackson on 01/19/03 08:43 AM

From the looks of it Monit doesn't restart a zombie process, although it will send you a nice email. But just to clear things up: there is no configuration associated with supervise, you just make the package and link the directory with your run file to the /service directory.

Every program has a startup script, or at least a command you have to run to start it, so it isn't surprising that both Monit and supervise use them.

Monit requires that your script, or your program create a pid file, and that it gets deleted when the process dies or is stopped.

Monit requires a configuration section for each process, something like the following (for sshd):

 check sshd with pidfile /var/run/sshd.pid
   start program  "/etc/init.d/sshd start"
   stop program  "/etc/init.d/sshd stop"
   host 192.168.1.1 port 22 protocol ssh
   timeout(5, 5)
   alert foo@bar
   alert 123456@sms on { timeout }

17: Re: My production site is down (response to 14)

Posted by Mat Kovach on 01/20/03 01:49 PM

I'd like to point out that if you look at the daemontools source 90% of the code is licensed under the public domain (which I know, since I'm repacking that code and releasing it under a different license).

18: Re: My production site is down (response to 1)

Posted by Tom Jackson on 01/20/03 05:43 PM

Mat,

I was just thinking of that. A few months back I was interested in using some of djb's APIs and I noticed that a number of authors had already gathered togeather his library functions into one place.

Where is your collection?

Also, just to note that two hackers can come to quite different opinions very easily. Roberto's original comment that

As long as there are free secure alternatives, who try to obey to standards, I'll stick to them.

is a useful goal. The argument seems to have been over what we think 'free', 'secure', 'alternative' and 'standards' (and 'them') actually mean. Sorry to be so Clintonesque this weekend. It is easy to label something easy to use, once you become familiar with it's limitations.

Now let me stop blabbering and try to come up with one of 'them' 'alternatives' for browsing tcl source code!

19: Re: My production site is down (response to 18)

Posted by Mat Kovach on 01/21/03 03:32 AM

[ Okay guys, this is slight off topic but what the heck ]

I'm involved with a User Group in Northeast Ohio, USA.

http://www.nooss.org

I am using the pubilc domain software to start a project with in the group [ annoucement here.

I have a copy of the start of the code available at:

http://devel.alal.com

That code is still public domain.

The plan is to have a "ready for primetime" distirbution near the end of Febuary. I'd be open to suggestions going forward.

I also using that code to create a ns_cdb module for AOLserver (which I need for some qmail/vpopmail/OpenACS intergration).

I think this is getting a bit off topic, but feel free to contact me via email if anybody wants any more detail, has suggestions, or wants to help.

20: djblib (response to 19)

Posted by Andrew Piskorski on 01/21/03 07:46 AM

Mat, hey, that sounds like a useful project. Particularly if we can
have Tcl bindings to djblib. :) Sorry, no suggestions or bright
ideas, but please do keep us informed.

21: Re: My production site is down (response to 1)

Posted by Brian Fenton on 01/23/03 04:31 PM

Co-incidentally I had to implement error log rolling today. Here's what I did based on an ACS 3 version I had working written by mailto:arjun@openforce.net :

In your server's .tcl file put these 2 lines after ns_section ns/parameters:

ns_param logroll on
ns_param maxbackup 30

Then create 2 files in /acs-tcl/tcl, one called logroll-init.tcl and the other called logroll-procs.tcl.

logroll-procs.tcl contains the following:
#
# logroll.tcl - Rolls the server log on a daily basis
#
# mailto:arjun@openforce.net
# May 17, 2001
# version 0.1
#
# Note: This script is for rolling the _server_ log not the _access_ log!
ad_library {
TCL library for the rolling error log.

@creation-date 23/01/2003
}

#
# Directions
# ----------
# 1. Set the MaxBackup ns_param in the "ns/parameters" section of your config
# file to the desired number of backups. NOTE: The AOLServer documentation
# incorrectly calls this param "LogMaxBackup"! You can verify this param is set
# correctly by seeing something like "Notice: conf: [ns/parameters]maxbackup = 6"
# in your server log.
#
#
# Further Work
# ------------
# - Verify the log got rolled, if not send email
# - Check for disk space
# - Allow for non-daily intervals
# - scp logs to a remote site(s)

# You don't have to edit anything below this line
ad_proc -public roll_server_log {
} {
Proc to roll server error log
} {
set MaxBackupValue [ns_config -int "ns/parameters" MaxBackup]
if {$MaxBackupValue == ""} {
# The default is 10
set MaxBackupValue 10
}

ns_log Notice "logroll.tcl: About to roll server log. ns_param MaxBackup = $MaxBackupValue "
ns_logroll
ns_log Notice "logroll.tcl: Just rolled server log"
}

logroll-init.tcl contains the following:
# Adjust the paramaters below. Hours are between 0-23,
# and minutes are between 0-59
set hour 16
set minute 00

ns_log Notice "Scheduling logroll"

# Schedule "roll_server_log" to run at the desired time
ns_schedule_daily $hour $minute roll_server_log

Then just bounce your AOLserver.

22: Re: My production site is down (response to 21)

Posted by Jade Rubick on 01/27/03 11:07 PM

Brian, thank you. That worked very well.

Shouldn't this be something we do by default?

WHEREAS:

1) unless you implement log rolling, your server will go down

2) hapless, pitiful souls who do not know they have to implement log rolling will fill our esteemed Forums with pleading requests for assistance, and bother those with better things to do with their time.

3) A solution is easy to implement, and easy to control.

SHOULD WE NOT THEREFORE:

1) add it to the distribution?

23: Re: My production site is down (response to 1)

Posted by Michael Bluett on 01/30/03 11:14 PM

About the size of the error log: I have OpenACS 4.6 on my home system, with the packages Notifications and ACS-Mail-Lite (necessary for the forums?) installed. After 3 hours the error log has grown by about 500,000 bytes (half a meg), that's about 4 Meg per day? I think some sort of default error log rotation would be useful.

24: Re: My production site is down (response to 23)

Posted by Michael Bluett on 02/02/03 09:39 PM

Looking at the packages I have installed, I will probably attempt to reduce the log size by commenting out notices in:
packages/notifications/tcl/reply-sweep-init.tcl, packages/search/tcl/search-procs.tcl and packages/acs-mail-lite/tcl/search-procs.tcl .
I think this is the best way to keep the errors log smaller than the traffic log :)

25: Re: My production site is down (response to 22)

Posted by Joel Aufrecht on 05/26/03 01:06 AM

added log-rolling to the to-do list for docs - I can just add it to the default config.tcl file, right?

26: Re: My production site is down (response to 25)

Posted by Tilmann Singer on 05/26/03 04:56 PM

As far as I know aolserver can only roll the access logs automatically, not the error log. Which is a pity. I would like to hear if there is any common solution for this, or if aolservers newer than 3.3ad13 can do it.

27: Re: My production site is down (response to 26)

Posted by Bart Teeuwisse on 05/26/03 08:36 PM

Tilmann,

one way to roll the access log is mentioned in https://openacs.org/forums/message-view?message_id=34256.

Here's my Tcl version:


#
# logroll.tcl.postload - Rolls the server log on the same basis as the access log.
#
# bart.teeuwisse@thecodemill.biz
# Oct 07, 2001
# version 0.1 based on prior work by
# arjun@openforce.net 
# May 17, 2001
# version 0.1 
#
# Note: This script is for rolling the _server_ log not the _access_ log!
#
# Directions
# ----------
# 1. Set the "ServerLog" in the ns/parameters section and the 
# "RollDay", "RollHour", "RollFmt" parameters in the
# ns/server/'yourserver'/module/nslog section of your config file.
#
# 2. Place this script a Tcl directory sourced at server startup
#
# Further Work
# ------------
# - Verify the log got rolled, if not send email
# - Check for disk space
# - scp logs to a remote site(s)

# Roll the server log and give it an extension of the current date and time.

proc roll_server_log {serverlog rollfmt} {
    ns_log Notice "logroll.tcl: About to roll server log."
    ns_logroll
    set date [clock format [clock seconds] -format $rollfmt]
    if {[file exists "$serverlog.000"]} {
	file rename "$serverlog.000" "$serverlog.$date"
	ns_log Notice "logroll.tcl: Just rolled server log into $serverlog.$date"
    } else {
	ns_log Warning "logroll.tcl: Just rolled server log but couldn't move it to $serverlog.$date"
    }
}

# Create argument list

set args [list]

# Find out where the log is stored.

lappend args [ns_config "ns/parameters" ServerLog]

# Roll the log when the access log is being rolled.

set rollday [ns_config "ns/server/[ns_info server]/module/nslog" RollDay]
set rollhour [ns_config -int "ns/server/[ns_info server]/module/nslog" RollHour]
set rollminute 0

# Use the same roll format as the access log.

lappend args [ns_config "ns/server/[ns_info server]/module/nslog" RollFmt]

if {$rollday == "*"} {

    # Schedule "roll_server_log" to run at the desired time

    ns_schedule_daily $rollhour $rollminute roll_server_log $args
} else {

    # Schedule "roll_server_log" to run only on RollDay days.

    ns_schedule_weekly $rollday $rollhour $rollminute roll_server_log $args
}

Place this file in /web/'yourserver'/tcl directory, follow the directions in the header and restart the server. Et voila!

/Bart

28: Re: My production site is down (response to 1)

Posted by Randy O'Meara on 05/26/03 09:02 PM

An alternate method would be to use 'multilog'. It is part of 'deamontools'. It will roll logs based on various conditions, one of which is file size. I have used multilog for quite some time without a problem. I particularly like the ability to use the 'svc -a /service/<servicename>/log' command to roll the logs on demand.

Another benefit in using multilog is that I can (and do) send package-specific log entries to a separate log file. This allows me to look at just the messages related to the package/module that I'm working on at the moment without having to weed through voluminous non-related log entries. There is some setup and a small amount of coding required to do this. For me, it's worth the trouble. It essentially involves adding a unique prefix string to messages as they are logged, and then using multilog to stream these prefixed messages to a separate file.

If there's interest, I will provide an example of this type of setup.

Randy

29: Re: My production site is down (response to 28)

Posted by Bart Teeuwisse on 05/26/03 10:16 PM

Randy,

please do post an example setup for package specific log files using multilog.

/Bart

30: Re: My production site is down (response to 26)

Posted by Andrew Piskorski on 05/27/03 06:34 AM

Tilmann, please read the earlier posts above in this same thread! AOLserver is quite capable of rolling both the access and the error/server logs itself. (AFAIK this applies to all versions, but definitely applies to AOLserver 3.3.) The methods for the two are gratuitously different, unfortunately, but they're both very easy. And both the easiest and second easiest ways to roll the error log (ns_logroll, SIGHUP) are both explained above, particularly in my Jan. 2003 post.

31: default OpenACS solution (response to 25)

Posted by Andrew Piskorski on 05/27/03 06:57 AM

added log-rolling to the to-do list for docs - I can just add it to the default config.tcl file, right?

Joel, you're on the right track but it's not quite that simple.

First, here's all the stuff I have in my AOLserver config.tcl related to both the server and access logs. You may want to change some of the settings for the OpenACS default config.tcl (I'm not sure), but you should at least look at what all of these do:

ns_section ns/parameters 
 
  ns_param serverlog $error_log_full_filename($which) 
  # Number of days to keep old error logs around: 
  ns_param maxbackup 7 
 
ns_section ns/server/$server_name/module/nslog 
 
  ns_param File $access_log_full_filename 
  # Number of days to keep old access logs around: 
  ns_param maxbackup 7 
 
  ns_param EnableHostnameLookup Off 
  ns_param LogRefer Off 
  ns_param LogUserAgent Off 
  ns_param maxbackup 7 
  ns_param RollDay * 
  ns_param RollFmt %Y-%m-%d-%H:%M 
  ns_param RollHour 0 
  ns_param RollOnSignal On 
  ns_param RollLog On 
  ns_param ExtendedHeaders X-User-Tracking

The settings above will take care of rolling the access log without doing anything else. But to roll the server log, you need to schedule ns_logroll (usually once per night at midnight) to roll the log. See my old Jan. 18 2003 definition of my dtk_roll_server_log proc above.

I think using a simple helper proc like that is a bit nicer than scheduling ns_logroll directly, because it writes a notice to the log before and after telling you what's going on. So you just need a proc like that (probably named "ad_roll_server_log"), and the call to ad_schedule_proc.

Hm, I don't see anywhere better to put it, so, ad_roll_server_log should probably just go into packages/acs-tcl/tcl/utilities-proc.tcl, and the call to ad_schedule_proc should go into the matching utilities-init.tcl file.

Once that's all done... then the only thing to put into the docs is pointers to those places, and telling people how to change it in case they need to.

32: Re: My production site is down (response to 1)

Posted by Randy O'Meara on 05/27/03 07:47 AM

I will begin with the assumption that daemontools has been installed and configured to run the target site. I will fill in more of the basic daemontools details if requested. Rest assured that this is not as complicated as it looks. This could probably be explained with far fewer words than I use below.

If you think this idea is useful, it could be generalized and incorporated into the system logging procedures very easily. This would allow selectively logging any package's messages to a separate file.

Do yourself a favor and install the daemontools man pages if you have not already done so. Get them from http://smarden.org/pape/djb/manpages/daemontools-0.76-man.tar.gz.

So, you should already have a /service directory, and there should also be a symlink /service/site that points at the site's root directory. Following this symlink and listing the directory contents should show that a 'run' file exists. This 'run' file contains the deamontools startup script for your site.

BASIC MULTILOG CONFIGURATION

Create a symlink to your physical logging location at the root of your site, and configure the log directory and run script:

# cd /service/site
# ln -s /var/log/site log
# mkdir log/error
# cd log
# touch run
# chmod 775 run
# emacs run

======== run file content ========

#!/bin/sh
exec setuidgid aolserverUsr \
aolserverGrp \
s3999999 ./error

(note that aolserverUsr/Grp are the user and group you've configured to run the site service. Older install docs suggest nsadmin:web, but the newer docs suggest a more secure setup)

Restart the site, set supervise ownership, and verify that basic logging is working as expected:

# svc -d /service/site
# svc -u /service/site
# chown -R aolserverUsr.aolserverGrp .
# chmod -R g+rwx supervise

You should now check the contents of the log/error directory. The 'current' file will grow with error log content and will automatically roll somewhere around 3999999 bytes. By default, multilog will keep the latest 10 log files but this number is configurable.

You can roll the logs manually, if you wish:

# svc -a /service/site/log

PACKAGE-SPECIFIC MULTILOG EXTENSION

This setup will allow you to log package-specific messages to a separate log file. This is in addition to the main log file configured above. The key to making this work dependably is to format your log messages with a unique prefix string so that multilog can then find them and place them in a separate file. This is extremely useful in developing a package since you don't have to peruse voluminous, unrelated log messages looking for the ones you've logged from your package. My setup includes the following.

Two package parameters:

debug default = "1"
logPrefix default = "!pkgmain"

nsv accessor procedures:

pkg::debug
pkg::logPrefix

At package initialization time, I place the debug and logPrefix values into package-specific nsv variables so that they can be changed dynamically at runtime.

A log procedure:

pkg::log

I wrap my log/debug messages inside a check for debug:

if {[pkg::debug]} {pkg::log -msg ...}

Note that, if you wish, you can turn debug logging on/off and make the prefix unique for *each* instance of a mounted package by changing the parameters for any mounted instance in the site-map.

Change the log/run file:

======== run file content ========
#!/bin/sh
exec setuidgid aolserverUsr \
aolserverGrp \
s3999999 +* ./error \
-* +*!pkg*:* ./pkg

(note that this will log all messages containing the wildcard string "!pkg*:" to the directory log/pkg. Active logging will be to file 'current', and these logs will roll just like the log/error directory does.)

SUPPORT CODE

========= pkg-init.tcl =========

nsv_set pkg_global logPrefix [pkg::parameter -name logPrefix -default "!pkgmain"]
nsv_set pkg_global debug [pkg::parameter -name debug -default 0]

========= pkg-procs.tcl =========

ad_proc -public pkg::log {
{-msg:required}
} {
Logs pkg messages with a specified logPrefix,
where logPrefix is an pkg package parameter settable at the Site Map
package mount point. 

NOTE: if !pkg is used as the first part of logPrefix and multilog (part of
Daemontools package) is being used to manage the service error log, then a line like the
following can be used to append all pkg log messages to separate "pkg" log: 

exec setuidgid nsadmin multilog s3999999 +* ./error -* +*!pkg*:* ./pkg

@param msg the string to be logged.
@return nothing.
} {
ns_log notice "[nsv_get pkg_global logPrefix]: $msg"
}

ad_proc -public pkg::debug {
} {
Returns the state of global pkg debug boolean (actually stored in pkg_global nsv).
This value is used throughout the pkg application to turn on verbose debug logging.
This value can be set through the APM interface as 'debug' is an pkg parameter.

@return true (1) or false (0) depending on the pkg 'debug' parameter setting.
} {
return [nsv_get pkg_global debug]
}

ad_proc -public pkg::parameter {
{-name:required}
{-default ""}
} {

} {
return [parameter::get -package_id [get_package_id] -parameter $name -default $default]
}