Forum OpenACS Q&A: Search Engine cloaking / IP delivery
page, in the case of search engines. Technically this is simple to
do, but the challenge is obtaining a list of search engine ip's.
This company provides such a service but uses a closed source Perl
I would like to hear what the forum says about this.
I am of the opinion that it would benifit search engines and search
public to have ip's published. Views?
I suggest you do a google search on 'search engine robot user agents'.
FWIW, this link looks pretty complete and has IP addresses for a few.
I'm also interested in this because we have three different user states:
- Visitor -- No Cookie, Not logged on
- Guest -- Has Cookie, logged on as Guest
- Member -- Has Cookie, logged on paying Member
At this point, all robots are Visitors and we show visitors intro stuff. But much of our content is available only to members and a smaller subset to Guests.
For example, if a member tries to send a visitor a link to member only content in the bboards, the visitor is redirected to a logon page or a page saying that they need to become a member.
So, we want the search engines to index our stuff and I've though of having some robot only pages that can do graceful redirects for real users who use the SE's.
Other thoughts on how to handle this?
-Bob We serve Different stuff to users based on
But Google caches pages, so Google will contain the content that you want to restrict to your members, and non-members can see it for free.
legal issues are with this, but it's awfully handy when a site is no
Bob, But Google caches pages, so Google will contain the content that you want to restrict to your members, and non-members can see it for free.
This is not true. IF they are not a member, they are redirected to another page. In other words, If you go to the same URL, there are three DIFFERENT pages seen: Visitor/no cookie, Guest,with cookie, Member, with cookie. These are dynamic pages requiring a database lookup.
Spiders don't have a cookie, they are "Visitors" and would see the intro content or the page that lets them register as a Guest or Member.
Allan Regenbaum kindly sent me a repair to robot-detection; repair of robots facility on 3.x couple changes required to make robots work ... first .. osme useragents are too long so .. SQL> alter table robots modify ( robot_useragent varchar(200)); second, the call to get the file in /tcl/ad-robot-defs.tcl ad_replicate_web_robots_db needs to change from set result [ns_geturl $web_robots_db_url headers] set result [ns_httpget $web_robots_db_url] The new URL to get a list of robots has changed per the response to Malcolms post... In your service.ini [ns/server/yourservername/acs/robot-detection] ; the URL of the Web Robots DB text file WebRobotsDB=http://www.robotstxt.org/wc/active/all.txt < this is the new URL ; which URLs should ad_robot_filter check (uncomment to turn system on) FilterPattern=/ecommerce/* >>> will cause a robot check on any vist to /ecommerce (as an example) ; FilterPattern=/members-only-stuff/* ; the URL where robots should be sent RedirectURL=/robot-heaven/ < create this directory with pages which suit the robots ; How frequently (in days) the robots table ; should be refreshed from the Web Robots DB RefreshIntervalDays=30
been cached by google. They control my content. Also, do they
only serve cahed content if the site is unavailable, because all
the links are to the actual sites?
I see no sign of any attempt by Google to determine if your site is up before offering the cache. (Really, would that be feasible? It wouldn't be fast!)
As soon as you let caching robots skip registration, you are allowing any savvy user to do the same (at least for READING your content) by using Google's caching feature. Old content which isn't available any more does seem to disappear from Google. And content which is changed on my site does seem to turn up in changed form on Google in a month or less, as expected.
I don't remember the address, unfortunately. I think Scott G. published it here once.
(FWIW, I may have been the first one to mention it on openacs.org when I described how I found my ACS pages (Problem Set Zero, etc.) after even Google's cached version of them had vanished).
2. I don't want Google to keep a cached version of my page.
Google automatically takes a "snapshot" of each page it crawls and caches it. This enables us to show the search terms highlighted on text heavy pages so users can find relevant information quickly, and to retrieve pages for users if the site's server temporarily fails. Users can access the cached version by choosing the "Cached" link on the search results page. If you do not want your content to be accessible through Google's cache, you can use the NOARCHIVE meta-tag. Place this in the <HEAD> section of your documents:
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
This tag will tell robots not to archive the page. Google will continue to index and follow links from the page, but will not present cached material to users.
If you want to allow other robots to archive your content, but prevent Google's robots from caching, you can use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
Note that the change will occur the next time Google crawls the page containing the NOARCHIVE tag (typically at least once per month). If you want the change to take effect sooner than this, the site owner must contact us and request immediate removal of archived content. Also, the NOARCHIVE directive only controls whether the cached page is shown. To control whether the page is indexed, use the NOINDEX tag; to control whether links are followed, use the NOFOLLOW tag. See the Robots Exclusion page for more information.
I had no idea that you worked at Google!
It seems then that you are the perfect person to ask about the robot detection module for a difinitive answer.
Originally PhilG described in his book how the robots module would detect a web crawler and serve customised textual content to it instead of the full or password protected page.
Does Google in fact revisit pages anonymously and compare them to pages retrieved by the googlebot to detect masquerading? If so am I correct in assuming that the robot detection package, though still in OpenACS 4.6.3 is no use any more?
robot detection was working somewhat in 4.6, though I had to track down the new home for the information (which may no longer be the best home).