Forum OpenACS Q&A: Search Engine cloaking / IP delivery

Posted by Malcolm Silberman on 01/10/02 08:34 PM

I am looking at a method to detect an ip and redirect to a specific
page, in the case of search engines. Technically this is simple to
do, but the challenge is obtaining a list of search engine ip's.
This company provides such a service but uses a closed source Perl
script http://www.ip-delivery.com/

I would like to hear what the forum says about this.

I am of the opinion that it would benifit search engines and search
public to have ip's published. Views?

2: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by John Sequeira on 01/10/02 09:43 PM

Well-behaved search engines will publish a specific user agent to the web server. I think it'd be much easier to find the official list of search engine user agents than IP addresses.

I suggest you do a google search on 'search engine robot user agents'.

FWIW, this link looks pretty complete and has IP addresses for a few.

3: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Tilmann Singer on 01/11/02 08:10 AM

Are you sure that you want to show a different content to search engines then to normal visitors? This contradicts to what the search engines are interested in, and I'd suspect that the more sophisticated ones check the indexed sites from time to time by visiting them from a non-published IP with a standard user-agent header to make sure they get the same or at least not totally differing content this way as they do when visiting the sites "officially".

4: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by MaineBob OConnor on 01/11/02 06:55 PM

I'm also interested in this because we have three different user states:

Visitor -- No Cookie, Not logged on
Guest -- Has Cookie, logged on as Guest
Member -- Has Cookie, logged on paying Member

At this point, all robots are Visitors and we show visitors intro stuff. But much of our content is available only to members and a smaller subset to Guests.

For example, if a member tries to send a visitor a link to member only content in the bboards, the visitor is redirected to a logon page or a page saying that they need to become a member.

So, we want the search engines to index our stuff and I've though of having some robot only pages that can do graceful redirects for real users who use the SE's.

Other thoughts on how to handle this?

Thank you
-Bob We serve Different stuff to users based on

5: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by C. R. Oldham on 01/11/02 06:59 PM

Bob,

But Google caches pages, so Google will contain the content that you want to restrict to your members, and non-members can see it for free.

6: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Matthew Terenzio on 01/12/02 02:58 AM

Search Engines cache pages for indexing. They can't legally serve full pages of your content, or have I missed something. Otherwise I'll declare myself a search engine and start serving up NBC news.

7: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Jade Rubick on 01/12/02 03:34 AM

Matthew, Google caches pages for display. I'm not sure what the
legal issues are with this, but it's awfully handy when a site is no
longer there.

8: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by James Harris on 01/12/02 05:46 AM

Google definitely caches pages for display. It was very handy when Arsdigita did their Orwellian rewriting of the bulletin board content during their dispute with PhilG. ALthough they removed content from their site, it was still visible from Google's cache.

9: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by MaineBob OConnor on 01/12/02 06:16 PM

C.R. wrote:

Bob, But Google caches pages, so Google will contain the content that you want to restrict to your members, and non-members can see it for free.

This is not true. IF they are not a member, they are redirected to another page. In other words, If you go to the same URL, there are three DIFFERENT pages seen: Visitor/no cookie, Guest,with cookie, Member, with cookie. These are dynamic pages requiring a database lookup.

Spiders don't have a cookie, they are "Visitors" and would see the intro content or the page that lets them register as a Guest or Member.

-Bob

10: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Malcolm Silberman on 01/14/02 02:46 PM

Of cource ACS has Robots detection. http://serverspace.com/doc/robot-detection , however I fear that relying on the easy to "fake" USER_AGENT variable could be a problem.

Allan Regenbaum kindly sent me a repair to robot-detection;


repair of robots facility on 3.x

couple changes required to make robots work ...

first .. osme useragents are too long so ..

SQL> alter table  robots modify ( robot_useragent varchar(200));

second, the call to get the file in /tcl/ad-robot-defs.tcl
ad_replicate_web_robots_db  needs to change from

    set result [ns_geturl $web_robots_db_url headers]

    set result [ns_httpget $web_robots_db_url]


The new URL to get a list of robots has changed per the response to Malcolms
post...

In your service.ini


[ns/server/yourservername/acs/robot-detection]
; the URL of the Web Robots DB text file
WebRobotsDB=http://www.robotstxt.org/wc/active/all.txt    <<< this is the
new URL
; which URLs should ad_robot_filter check (uncomment to turn system on)
 FilterPattern=/ecommerce/*                  >>> will cause a robot check on
any vist to /ecommerce (as an example)
; FilterPattern=/members-only-stuff/*
; the URL where robots should be sent
RedirectURL=/robot-heaven/                   <<<< create this directory with
pages which suit the robots
; How frequently (in days) the robots table
; should be refreshed from the Web Robots DB
RefreshIntervalDays=30

11: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Matthew Terenzio on 01/14/02 07:07 PM

So If want to take some of my content off the net, I can't if it has
been cached by google. They control my content. Also, do they
only serve cahed content if the site is unavailable, because all
the links are to the actual sites?

12: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Cathy Sarisky on 01/14/02 11:17 PM

Google offers the cache (on the last line of each listing is a "Cached" link) regardless of whether or not the site is up or has new content. The direct link to the site is certainly more obvious, and will probably be followed by the naive user over the cached link. (The cached page includes a note that it may be stale and a link to the current content on your site.) If you're getting requests for images without their corresponding pages, you might be seeing someone looking at Google's cache of a page (since it will grab images from your site normally). Or of course someone might have linked to one of your images for use on their website. Isn't bandwidth theft fun?

I see no sign of any attempt by Google to determine if your site is up before offering the cache. (Really, would that be feasible? It wouldn't be fast!)

As soon as you let caching robots skip registration, you are allowing any savvy user to do the same (at least for READING your content) by using Google's caching feature. Old content which isn't available any more does seem to disappear from Google. And content which is changed on my site does seem to turn up in changed form on Google in a month or less, as expected.

13: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Jade Rubick on 01/14/02 11:24 PM

I don't remember the exact address, but there is a website/search engine/spider that keeps historical snapshots of websites. It's pretty interesting to go back and check how your webpages evolved.

I don't remember the address, unfortunately. I think Scott G. published it here once.

14: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by David Cohen on 01/14/02 11:42 PM

You're talking about the Internet Wayback Machine, a spin-off of Brewster Kahle's Alexa project: web.archive.org

(FWIW, I may have been the first one to mention it on openacs.org when I described how I found my ACS pages (Problem Set Zero, etc.) after even Google's cached version of them had vanished).

15: Response to Search Engine cloaking / IP delivery (response to 1)

Posted by Jerry Asher on 01/14/02 11:56 PM

Could this help solve your problems?

From: http://www.google.com/webmasters/3.html#B2

2. I don't want Google to keep a cached version of my page.
Google automatically takes a "snapshot" of each page it crawls and caches it. This enables us to show the search terms highlighted on text heavy pages so users can find relevant information quickly, and to retrieve pages for users if the site's server temporarily fails. Users can access the cached version by choosing the "Cached" link on the search results page. If you do not want your content to be accessible through Google's cache, you can use the NOARCHIVE meta-tag. Place this in the <HEAD> section of your documents:

This tag will tell robots not to archive the page. Google will continue to index and follow links from the page, but will not present cached material to users.

If you want to allow other robots to archive your content, but prevent Google's robots from caching, you can use the following tag:

Note that the change will occur the next time Google crawls the page containing the NOARCHIVE tag (typically at least once per month). If you want the change to take effect sooner than this, the site owner must contact us and request immediate removal of archived content. Also, the NOARCHIVE directive only controls whether the cached page is shown. To control whether the page is indexed, use the NOINDEX tag; to control whether links are followed, use the NOFOLLOW tag. See the Robots Exclusion page for more information.

16: Re: Search Engine cloaking / IP delivery (response to 1)

Posted by Richard Hamilton on 10/29/03 04:35 PM

Jerry,

I had no idea that you worked at Google!

It seems then that you are the perfect person to ask about the robot detection module for a difinitive answer.

Originally PhilG described in his book how the robots module would detect a web crawler and serve customised textual content to it instead of the full or password protected page.

Does Google in fact revisit pages anonymously and compare them to pages retrieved by the googlebot to detect masquerading? If so am I correct in assuming that the robot detection package, though still in OpenACS 4.6.3 is no use any more?

Regards
Richard

17: Re: Search Engine cloaking / IP delivery (response to 1)

Posted by Don Baccus on 10/30/03 12:27 AM

Actually we had to remove a couple of openacs.org pages from Google's cache about three months ago - I won't go into details, it was a legitimate request. To add to Jerry's information you can request they reindex your page (therefore seeing the metatag and doing the removal) via the web so it's easy to do.

robot detection was working somewhat in 4.6, though I had to track down the new home for the information (which may no longer be the best home).