Forum OpenACS Q&A: Search Engine cloaking / IP delivery

I am looking at a method to detect an ip and redirect to a specific
page, in the case of search engines. Technically this is simple to
do, but the challenge is obtaining a list of search engine ip's.
This company provides such a service but uses a closed source Perl
script http://www.ip-delivery.com/

I would like to hear what the forum says about this.

I am of the opinion that it would benifit search engines and search
public to have ip's published. Views?

Collapse
Posted by John Sequeira on
Well-behaved search engines will publish a specific user agent to the web server. I think it'd be much easier to find the official list of search engine user agents than IP addresses.

I suggest you do a google search on 'search engine robot user agents'.

FWIW, this link looks pretty complete and has IP addresses for a few.

Collapse
Posted by Tilmann Singer on
Are you sure that you want to show a different content to search engines then to normal visitors? This contradicts to what the search engines are interested in, and I'd suspect that the more sophisticated ones check the indexed sites from time to time by visiting them from a non-published IP with a standard user-agent header to make sure they get the same or at least not totally differing content this way as they do when visiting the sites "officially".
Collapse
Posted by MaineBob OConnor on

I'm also interested in this because we have three different user states:

  • Visitor -- No Cookie, Not logged on
  • Guest -- Has Cookie, logged on as Guest
  • Member -- Has Cookie, logged on paying Member

At this point, all robots are Visitors and we show visitors intro stuff. But much of our content is available only to members and a smaller subset to Guests.

For example, if a member tries to send a visitor a link to member only content in the bboards, the visitor is redirected to a logon page or a page saying that they need to become a member.

So, we want the search engines to index our stuff and I've though of having some robot only pages that can do graceful redirects for real users who use the SE's.

Other thoughts on how to handle this?

Thank you
-Bob We serve Different stuff to users based on

Collapse
Posted by C. R. Oldham on
Bob,

But Google caches pages, so Google will contain the content that you want to restrict to your members, and non-members can see it for free.

Collapse
Posted by Matthew Terenzio on
Search Engines cache pages for indexing. They can't legally serve full pages of your content, or have I missed something. Otherwise I'll declare myself a search engine and start serving up NBC news.
Collapse
Posted by Jade Rubick on
Matthew, Google caches pages for display. I'm not sure what the
legal issues are with this, but it's awfully handy when a site is no
longer there.
Collapse
Posted by James Harris on
Google definitely caches pages for display.  It was very handy when Arsdigita did their Orwellian rewriting of the bulletin board content during their dispute with PhilG.  ALthough they removed content from their site, it was still visible from Google's cache.
Collapse
Posted by MaineBob OConnor on

C.R. wrote:

    Bob, But Google caches pages, so Google will contain the content that you want to restrict to your members, and non-members can see it for free.

This is not true. IF they are not a member, they are redirected to another page. In other words, If you go to the same URL, there are three DIFFERENT pages seen: Visitor/no cookie, Guest,with cookie, Member, with cookie. These are dynamic pages requiring a database lookup.

Spiders don't have a cookie, they are "Visitors" and would see the intro content or the page that lets them register as a Guest or Member.

-Bob

Collapse
Posted by Malcolm Silberman on
Of cource ACS has Robots detection. http://serverspace.com/doc/robot-detection , however I fear that relying on the easy to "fake" USER_AGENT variable could be a problem.
Allan Regenbaum kindly sent me a repair to robot-detection;


repair of robots facility on 3.x

couple changes required to make robots work ...

first .. osme useragents are too long so ..

SQL> alter table  robots modify ( robot_useragent varchar(200));

second, the call to get the file in /tcl/ad-robot-defs.tcl
ad_replicate_web_robots_db  needs to change from

    set result [ns_geturl $web_robots_db_url headers]

    set result [ns_httpget $web_robots_db_url]


The new URL to get a list of robots has changed per the response to Malcolms
post...

In your service.ini


[ns/server/yourservername/acs/robot-detection]
; the URL of the Web Robots DB text file
WebRobotsDB=http://www.robotstxt.org/wc/active/all.txt    <<< this is the
new URL
; which URLs should ad_robot_filter check (uncomment to turn system on)
 FilterPattern=/ecommerce/*                  >>> will cause a robot check on
any vist to /ecommerce (as an example)
; FilterPattern=/members-only-stuff/*
; the URL where robots should be sent
RedirectURL=/robot-heaven/                   <<<< create this directory with
pages which suit the robots
; How frequently (in days) the robots table
; should be refreshed from the Web Robots DB
RefreshIntervalDays=30
Collapse
Posted by Matthew Terenzio on
So If want to take some of my content off the net, I can't if it has
been cached by google. They control my content. Also, do they
only serve cahed content if the site is unavailable, because all
the links are to the actual sites?
Collapse
Posted by Cathy Sarisky on
Google offers the cache (on the last line of each listing is a "Cached" link) regardless of whether or not the site is up or has new content.  The direct link to the site is certainly more obvious, and will probably be followed by the naive user over the cached link.  (The cached page includes a note that it may be stale and a link to the current content on your site.)  If you're getting requests for images without their corresponding pages, you might be seeing someone looking at Google's cache of a page (since it will grab images from your site normally).  Or of course someone might have linked to one of your images for use on their website.  Isn't bandwidth theft fun?

I see no sign of any attempt by Google to determine if your site is up before offering the cache.  (Really, would that be feasible?  It wouldn't be fast!)

As soon as you let caching robots skip registration, you are allowing any savvy user to do the same (at least for READING your content) by using Google's caching feature.  Old content which isn't available any more does seem to disappear from Google.  And content which is changed on my site does seem to turn up in changed form on Google in a month or less, as expected.

Collapse
Posted by Jade Rubick on
I don't remember the exact address, but there is a website/search engine/spider that keeps historical snapshots of websites. It's pretty interesting to go back and check how your webpages evolved.

I don't remember the address, unfortunately. I think Scott G. published it here once.

Collapse
Posted by David Cohen on
You're talking about the Internet Wayback Machine, a spin-off of Brewster Kahle's Alexa project: web.archive.org

(FWIW, I may have been the first one to mention it on openacs.org when I described how I found my ACS pages (Problem Set Zero, etc.) after even Google's cached version of them had vanished).

Collapse
Posted by Jerry Asher on
Could this help solve your problems?

From: http://www.google.com/webmasters/3.html#B2

2. I don't want Google to keep a cached version of my page.
Google automatically takes a "snapshot" of each page it crawls and caches it. This enables us to show the search terms highlighted on text heavy pages so users can find relevant information quickly, and to retrieve pages for users if the site's server temporarily fails. Users can access the cached version by choosing the "Cached" link on the search results page. If you do not want your content to be accessible through Google's cache, you can use the NOARCHIVE meta-tag. Place this in the <HEAD> section of your documents:

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

This tag will tell robots not to archive the page. Google will continue to index and follow links from the page, but will not present cached material to users.

If you want to allow other robots to archive your content, but prevent Google's robots from caching, you can use the following tag:

<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">

Note that the change will occur the next time Google crawls the page containing the NOARCHIVE tag (typically at least once per month). If you want the change to take effect sooner than this, the site owner must contact us and request immediate removal of archived content. Also, the NOARCHIVE directive only controls whether the cached page is shown. To control whether the page is indexed, use the NOINDEX tag; to control whether links are followed, use the NOFOLLOW tag. See the Robots Exclusion page for more information.

Collapse
Posted by Richard Hamilton on
Jerry,

I had no idea that you worked at Google!

It seems then that you are the perfect person to ask about the robot detection module for a difinitive answer.

Originally PhilG described in his book how the robots module would detect a web crawler and serve customised textual content to it instead of the full or password protected page.

Does Google in fact revisit pages anonymously and compare them to pages retrieved by the googlebot to detect masquerading? If so am I correct in assuming that the robot detection package, though still in OpenACS 4.6.3 is no use any more?

Regards
Richard

Collapse
Posted by Don Baccus on
Actually we had to remove a couple of openacs.org pages from Google's cache about three months ago - I won't go into details, it was a legitimate request.  To add to Jerry's information you can request they reindex your page (therefore seeing the metatag and doing the removal) via the web so it's easy to do.

robot detection was working somewhat in 4.6, though I had to track down the new home for the information (which may no longer be the best home).