Forum OpenACS Q&A: Re: Google & Co on dynamic content

Collapse
Posted by Jeff Davis on
I think the lack of google indexing on openacs.org has more to do with the robots.txt file:
User-agent: *
Disallow: /
I added this a few months ago before the new memory was added since every time the site was spidered it would fall over (related to the memory leak I think).

We can change it back and see if spidering is ok but someone will need to keep an eye on things.

Collapse
Posted by Richard Hamilton on
Quite a lot of info in this thread so can I just clarify a couple of points please.

1) Are we saying that the robots detection module that Philip Greenspun wrote about in 'The Book' has no place in the ACS anymore because cloaking pages is considered underhand and will result in your site being blacklisted?
2) Are we also concluding that some of the assumptions that Philip made in relation to content that the search engines would not index are no longer correct (such as the impact of frames and other pretty content) and that search engines will now follow links with extensions such as .tcl, .cgi and others and do their best to index the content thereby rendering the acs robots module redundant?
3) Someone earlier suggested a link to all postings somewhere visually inconspicuous on the site. Is this a good idea or can anyone think of a better way to do it?

Regards
Richard
Collapse
Posted by Tilmann Singer on
Regarding the cloaking issue I remember that once a competitors website did that. They registered a bunch of bogus domains and created hundreds of sub-domains, which all redirected to the competitors main page normally, but when called with a user agent of google (I actually tried it myself), they returned a list of links to all the other bogus websites instead, thus trying to fool the page rank algorythm.

I mailed google and they said they were working on techniques to automatically detect and ban such sites, and banned the offending one manually. That was a few years ago, and the source for my assumption that returning different content based on googlebot user-agent header might be a bad idea. It might as well be though that they have a way of distinguishing between sites that try to fool the pagerank mechanism and those that only return more search engine friendly content, although I can't imagine how that would work 100% reliably.

Anyway, speculations about google behaviour could be continued endless I guess, but that is not my intention. Ok, a last one: I think if we remove the restriction in robots.txt (and the site doesn't fall over when being indexed) then google will index the full site including all postings after some time, and neither query variables nor the paginator on the forums index page will scare it away.