Forum OpenACS Q&A: OpenACS polling webcrawler and confusing my local system.

I have OpenACS 3.2.4 running on my local RedHat 6.1 computer.

    I have been playing with it, trying all sorts of modules and the host
computer is my main Linux box too.

    OpenACS or AOLServer is tangling with my existing mail cronjob and
name or internet or mail setup. It seems to be trying to do things on
the internet when I am not connected. It seems to be getting caught
when my cron job fetches mail and then the cron job drops the ppp link
while OpenACS must be struggling with a name problem or no response.

    What I notice is screwy behavior when I start Netscape (locally).
Netscape never comes back from some startup process. I don't know how
to find what is hung up. Resolv.conf only has two IP addresses for my
ISP's nameservers.

    So far, simply killing netscape or other recently started processes
doesn't seem to work. Usually I end up stopping AOLServer and
restarting GNOME.

    I have received error message emails where OpenACS is trying to
access http://info.webcrawler.com/mak/projects/robots/active/all.txt

    I have grepped and found this URL in a couple places. Is the lookup
performed when OpenACS reads that main configuration file when it
starts?

    It seems to me OpenACS and AOLServer are running into either a name
lookup problem or my intermittent internet connection is causing
something to hang incomplete.

    Can you point me to OpenACS modules to study and troubleshooting
tools to use?

    At one time, I set /etc/hosts localhost entry to 127.0.0.0 and that
causes the most amazing error messages!

    Thanks for your suggestions.
It is trying to read a public list of known robots/bulk downloading
programs to load into the database.  It uses that list to block
attempts to download your entire site by such robots.  On a mature ACS
Classic or OpenACS installation, the sheer size of the questions and
answers in bboard forums, news, ecommerce pages, etc etc can be very
big.  When a bulk downloader takes a swing through your system, the
load on your system can grow to the point where in effect you're
suffering from a denial-of-service scenario (some of these downloaders
download your site by chasing links in parallel rather than serially).

And if you pay for bandwidth consumption, which is common at some
ISPs, this can also become expensive for the site owner.

So, short story is that the kind people at webcrawler.com publish a
list of robots which dive into sites and ignore the robots.txt file
that describes which parts of a site should and shouldn't be traversed
by a robot.  ACS/OpenACS use that list to protect your server.

You can turn this off but you'll have to dig through the code a bit to
find out where, my active servers and at-home machines have full-time
connections so I've never had to do so myself.

  There's also a spam (e-mailing) daemon that wants to run
periodically, but if you don't queue alerts or user messages it
shouldn't actually do anything other than peek at the database tables
looking for messages.

    Lee> 	What I notice is screwy behavior when I start Netscape
    Lee> (locally).  Netscape never comes back from some startup
    Lee> process. I don't know how to find what is hung
I've seen this problem when not connected to the internet. I assume that netscape is doing name lookup so that it can talk to netscape.com. I found that if I used netscape-navigator instead of netscape-communicator the problem went away. I think that netscape-communicator is linked to netscape by default on redhat systems.
    Lee> 	I have received error message emails where OpenACS is
    Lee> trying to access
    Lee> http://info.webcrawler.com/mak/projects/robots/active/all.txt
To get around this problem why working unconnected to the internet, I just setup a seperate instance of aolserver to act as the the info.webcrawler site. The url for this is configureable in the ad.tcl file, so I just set it to http://localhost:8001/all.txt.
Your problem with NetScape can be solved by getting a DNS cache.  Try
installing something like dnsmasq or pdnsd.  I found pdnsd to be more
stable.