Forum OpenACS Q&A: Google & Co on dynamic content

Posted by Christof Spitz on 05/15/03 10:14 AM

Could some wise person tell if search engines like Google etc. find dynamically created content? That would probably mean they somehow "hit" the *.tcl script and scan the resultant html-page?

It's because of the question of a participant in our online community if robots can somehow extract email-addresses out of an openacs installation. And it would be interesting to know if our dynamic content appears in a Google search, because if not, this could be a disadvantage if you want your site to be found.

(Sorry, I am quite ignorant about the technical background.)

2: Re: Google & Co on dynamic content (response to 1)

Posted by David Cotter on 05/15/03 11:26 AM

Christof

Google just browses your site following hyperlinks that appear on pages. So on the openacs honmepage, for example, there are links to messages on the forums which google can retrieve.

Google can't login to the system though so it can only see content that is available to non-registered users.

If you click on the name of a poster on this forum you will not see their email address if you're not logged in and neither will Google.

As a developer you can detect if the visitor is Google and either block it or server it different content etc.

3: Re: Google &amp; Co on dynamic content (response to 2)

Posted by Tilmann Singer on 05/15/03 02:10 PM

_As a developer you can detect if the visitor is Google and either block it or server it different content etc._

This is called cloaking and will most likely get you blocked from google indexing totally, because they test from time to time with non-detectable user-agents and ip addresses to see if you are serving the same content to the indexer and to the users, and if not then they assume that you are trying to cheat them.

4: Re: Google & Co on dynamic content (response to 1)

Posted by Tom Jackson on 05/15/03 05:38 PM

You used to have to worry about having query strings in your public urls, because search engines would not index them, for fear of falling into an infinitly deep sub-web.

Google now indexes any link on your page, even javascript links. However, they only go to a certain depth on any site now. They believe that any content of importance is within a few levels of an index page.

I haven't figured out how they know what an index page is, but it is interesting to listen to them pontificate on their bot technology. Put up a large site and eventually googlebot will visit like _Attack of the Clones_.

This might have implications for using pages like index.vuh, where the same page will occur at different depths, or under many subdirectories in a package. I don't think robots.txt allows wildcards in the path portion of a url.

5: Re: Google & Co on dynamic content (response to 1)

Posted by James Thornton on 05/15/03 09:04 PM

Google now indexes any link on your page, even javascript links.

First that I have heard that they have started following JavaScript.

Google indexes dynamic content, to a point. The less parameters in the URL, the more-likely Google will index it. Also, if the links to the dynamic pages are from a static page (i.e., no params), there is a better chance Google will index it, and Google's capabilities in this regard are improving all the time.

6: Re: Google & Co on dynamic content (response to 1)

Posted by Brian Fitzgearld on 05/16/03 09:06 AM

I'm disappointed in the response to this and other threads about Google and OpenACS. In our experience at Greenpeace, Google is doing an appalling job of indexing the site since we switched to OpenACS. It's a massive drawback, and I would think a major barrier to wider acceptance of OpenACS.

While I accept that it may be a generic problem to dynamic content sites, I don't know many major sites that would put up with their content living out their in the googleless void for long -- there must be solutions, and if there's one that applies to OpenACS sites, let's hear it. I'm yet to see a post from anyone adequately acknowledging this problem or suggesting a fix. Tillman's suggestion is the closest thing to a workaround I've seen, but if it gets you banned from google, you might as well be banned from the web.

This is a biggy.

--b

(P.S. Aw jeepers, my first post to this forum, and it has to be a rant. Howdy, y'all)

7: Re: Google & Co on dynamic content (response to 1)

Posted by Dirk Gomez on 05/16/03 10:38 AM

Someone at the OpenACS social said that most search engines would just cut off the query string...

What about offering an alternative format e.g.

https://openacs.org/forums/message-view/100650

That should be more search-engine compliant right?

8: Re: Google & Co on dynamic content (response to 1)

Posted by Dave Bauer on 05/16/03 12:46 PM

Brain,

I agree with Dirk. Setting up index.vuh files to interpret the URLs is reasonably easy and could be used on the greenpeace web site.

I haven't personally seen any problems with my personal site, checking the logs, google and every other search engine on the planet have visited and indexed the whole thing. But that might have to do with the fact that it is not very deep.

9: Re: Google & Co on dynamic content (response to 1)

Posted by Andrew Herne on 05/16/03 01:26 PM

Like Brian a first poster here. I've been lurking too long. Hi all.

Introducing myself. I'm IT Director (and yes that means programmer too!) at National Extension College, a distance learning not-for-profit in the UK.

http://www.nec.ac.uk

We've been running a very modified ACS 3.4 since summer 2001. Always heads down and no time to chat. [I know, lame excuse.]

This may or may not be relevant. I think it's the idea not the detail that may be helpful. We had big problems with Google mid 2002, losing all presence. Tracking down the problem proved tortuous, and we never really got to the bottom of it.

We realised that user sessions were implicated, and in particular the usca_p query string appended to ACS 3.4 ecommerce pages. We use a modified ecommerce module for most of our public pages as a rudimentary CMS.

We were able to recover by hacking a standard proc in ecommerce-defs.tcl:

ec_create_new_session_if_necessary

At the top of that proc we now list known spider user agents (googlebot, scooter, slurp, etc.) and match against return value of util_GetUserAgentHeader. If we have a match we set the user_session_id to 0 and quit the proc. It's my belief that Google gets stuck in a session loop which obviously is not relevant to it but led to it failing to spider the site.

10: Re: Google &amp; Co on dynamic content (response to 8)

Posted by David Cotter on 05/16/03 03:34 PM

Dave

>Setting up index.vuh files to interpret the URLs is
>reasonably easy and could be used on the greenpeace web
>site.
I don't understand how this would help (I've never used a vuh). It would be possible to get a vuh to interpret

https://openacs.org/forums/message-view/100650

https://openacs.org/forums/message-view?message_id=100650

but how do you present these modified URLs to google - would it not require quite a bit of work.

11: Re: Google & Co on dynamic content (response to 1)

Posted by Dave Bauer on 05/16/03 03:52 PM

David,

I am not sure of the exact amount of work. Generally I think we want to move toward building in support for URLs like this. We added support for the URLs without ? for forums, but I don't know if the work was finished so that all links that are presented use these "pretty" URLs instead of the standard ones.

So the adp templates what diplay links would need to be modified to support the new URLs.

12: Re: Google & Co on dynamic content (response to 1)

Posted by David Cotter on 05/16/03 04:12 PM

I was not aware thet these types of URL were considered non-standard. Is this mostly because of search engine problems or are there other reasons why they are undesirable?

13: Re: Google & Co on dynamic content (response to 1)

Posted by Jerry Asher on 05/16/03 05:08 PM

Getting back to the original question about harvesting email addresses, I will note that it turns out that most openacs pages, but not all, that show email addresses are only provided if the user logs in. The ones that that leak this info are usually a bug. (I found a lot of these in various ticket tracke bug report modules.)

That said, from time to time, I've advocated an email mangling API to be used whenever an email address is presented. It may obscure the email entirely or display it in a jpeg or display the @ as a gif or something like that when there is no user logged in, and it might present the email address as a mailto, or something different for when a user is logged in.

I am amused by how often a google search turns up results that include statements about how google should be using a frames compliant browser.

14: Re: Google & Co on dynamic content (response to 1)

Posted by Tom Jackson on 05/16/03 05:50 PM

Jerry: great idea about mangling the email addresses (where they show up). Probably your proposed metadata api for AOLserver would be useful for that purpose. Personally I would like the jpeg solution, although that leaves non-visual/non-graphic UAs at a disadvantage.

I have run a large dynamic site for a number of years (saleonall.com). This is based on the ecommerce package, but I have moved the navigation and display of product information to a 'static looking' setup. Also, the parts that should not be visited by robots are in separate directories, so they are easy to exclude. Look at robots.txt .

Still, Google fails to index the entire content. However they do index some of the site, closer to the top of the directory structure, probably around 10-20k pages. When they do, they send in an army of 30 or so bots, each one more brain dead than the last. Because of this I actually create a static copy of the 150k or more pages and give those to GoogleBot.

I think I am going to create a shallow (by product_id) hirearchy so Google will be happy. Probably products will be displayed in a url like http://saleonall.com/cat/01/012345/product.html . This way the entire site is within two hops of an index page. The 00-99 pages could be built each time the database is turned over so they are static.

15: Re: Google & Co on dynamic content (response to 1)

Posted by Don Baccus on 05/16/03 07:21 PM

Brian ... have you forgotten that we've implemented a solution that presents a URL without a "?" that can be turned on for Planet? The first cut was implemented by Elephant 7 but didn't work. I fixed it and at one time it was working but since "google indexes sites with '?' variables" it was decided to leave it turned off.

You folks might do some testing on a development server to make sure it still works, and if it does you might turn it on for Planet to see if google then does a better job.

For the rest of you, the solution was very simple: It just uses a simple algorithm to mangle the URL with a "?" into a simple string, then when the request processor processes a hit the URL is unmangled back to the original URL.

No client code needs to be written. No index.vuh files need to be written. URLs just need to be filtered through the mangle proc rather than placed directly in templates.

But since google supposedly doesn't care about "?" this shouldn't be necessary ...

Also, Brian ... you guys should try to analyze what google's finding and not finding on Planet. It may be that now that you have a dynamic content system you're changing content more frequently, and that this causes Google to miss content because it expires before the site's reindexed. Or that archive links are broken or something along those lines.

16: Re: Google &amp; Co on dynamic content (response to 15)

Posted by Andrew Piskorski on 05/17/03 07:00 PM

Don, Brian, is that URL mangling/demangling code in the OpenACS toolkit? If not, would Greenpeace be willing to contribute it?

17: Re: Google & Co on dynamic content (response to 1)

Posted by Don Baccus on 05/17/03 08:01 PM

It's not in the toolkit but the entire Planet code base is available if you want it. Most of the code's an absolute mess (Peter, Lars and I did a bunch of repair/clean-up last spring to make it functional but we left a lot of it in the awful state we inherited it in) which is why you don't find pointers to the source in flashing bright lights all over the internet. :)

18: Re: Google & Co on dynamic content (response to 1)

Posted by Michael Bluett on 05/17/03 10:22 PM

I have a couple of points to make on this, though I haven't actually run an OpenACS installation:
The extraction of emails by robots from an OpenACS installation should be small as only the maintainer has their email address available to view. User's email addresses are hidden until the visitors log in, which robots don't. I believe that the Directory package, is vulnerable to robots as it features a list of users and their email addresses.

Google will extract the data from a website to a certain depth dependent upon the PR of a website. My advice is to read many articles on indexing at somewhere like WebMasterWorld.

Much of the problems that Google has with dynamic sites is with things like session ids being stored in the URL (for example as "id=" with php). This makes it more reluctant to follow URLs that look like they have session variables in the URL (i.e. exactly "id=")

Greenpeace sound like they need better referencing for Google, pages primarily designed not for the user, but for Google to navigate to all content on the site.

19: Re: Google & Co on dynamic content (response to 1)

Posted by Brian Fitzgearld on 05/19/03 06:25 PM

Hey Don,

Alas, even the pages that don't contain a ? URL fail to be googled, and that includes pages that have not changed since launch. See, for example, http://www.greenpeace.org/aboutus/

Contains the phrase "As one of the longest banners we've ever made" and has contained it since Greenpeace Planet inception. Google is aware of this phrase only at an alternate site that quotes it.

As to what gets googled and doesn't, our experience *sounds* closer to Tom's: some top of the directory indexing and then the brainless bots get tired and go away. I've seen articles indexed as part of the aggregating-menus (News, Features, Press) but absent themselves from google, so it may just be how many hops down the tree. There's a fairly detailed bug report on this in the GP Planet Bugzilla about what's getting hit and what's not if you're keen.

I'll ask Bruno and Alex to drop by this thread and call attention to the VUH idea and Michael's suggestions -- all look sound, but me, I only know enough to ask the question, not evaluate the answers. 😉

Great to see the community response to this. Thanks all.

--b

20: Re: Google & Co on redirects (response to 1)

Posted by Dirk Gomez on 05/19/03 11:53 PM

This is with respect to https://openacs.org/forums/message-view?message_id=94374. In short providing a short (and stable) URL that redirects to the "proper" page.

How do search engines - google in particular - handle redirects? Dozens of redirects on one page?

21: Re: Google & Co on dynamic content (response to 1)

Posted by Torben Brosten on 05/20/03 12:54 AM

My understanding is that Google and other search engines don't index urls that redirect. I think its a way they use to minimize duplicate data in their indexes. Noncompliant html HEAD code is another critical factor that stops googlebot.

Tired bots? My experience with staticly generated websites containing thousands of pages and 5 levels suggests that the bots don't get tired of specific, unique information.

The openacs static-pages package changes the page depending on the requesting client. As pointed out earlier in this thread, some searchengines, such as google, attribute this behavior to manipulation and may not index the page or site accordingly.

Is it possible to configure static-pages so that urls are presented as static pages in all cases? Perhaps the quick and dirty method may be to tweak the registered robots list to none? If not possible, then a package needs to be written to address these issues.

The package should interpret static looking pages and convert/present static looking urls for the domain served. Also, in a best case scenario, there should be a way to store requested, cached dynamic pages as static files so that the server can keep up with bot requests. Not delivering the pages at static-page speed may result in lost opportunities for indexing --given that there is limited time between index updates.

22: Re: Google & Co on dynamic content (response to 1)

Posted by brijesh singh on 05/20/03 03:25 AM

Can i know who difficult it is to convert .tcl link to .HTML link. As Search engine give preference to HTML links, will it work if we can remove .tcl as my site have .tcl extension and keep it without any extension. Will still my site will get same attention.

23: Re: Google &amp; Co on dynamic content (response to 19)

Posted by Robert Locke on 05/20/03 06:01 AM

Hi Brian,

<blockquote> See, for example, http://www.greenpeace.org/aboutus/
Contains the phrase "As one of the longest banners we've
ever made" and has contained it since Greenpeace Planet
inception. Google is aware of this phrase only at an
alternate site that quotes it.
</blockquote>

I did a search on Google as follows:
"as one fo the longest banners we've" site:www.greenpeace.org

and www.greenpeace.org/aboutus/ appeared as the only result.

I'm guessing you didn't see it because Google filters out redundant results, but Google is definitely aware of the page. Click on the "repeat the search with the omitted results included" link to see all results.

I checked a few of the "deeper" pages in the Greenpeace site and then ran a search on Google for a phrase within that page, adding "site:www.greenpeace.org" (to limit the search results). And Google appeared to be aware of the them (atleast the ones I checked). Google also seemed to be aware of the various versions of the page (eg, /ships/ship-detail?ship... and /international_en/ships/ship-detail?ship...

Perhaps one of the problems is your ranking within Google, which is a separate issue. I know there are companies/software which can supposedly help in that department, but I don't know if they are reliable.

24: Re: Google & Co on dynamic content (response to 1)

Posted by Don Baccus on 05/20/03 06:37 PM

Ahhh ... so content is getting googled on Planet ... Brian, you guys need to dig deeper. I think, as Robert's suggesting, that you may have a ranking issue but that's another can of worms altogether.

25: Re: Google & Co on dynamic content (response to 1)

Posted by Michael Bluett on 05/20/03 11:54 PM

There look to be "about 17,700 pages" from Greenpeace spidered on Google according to this search (This search finds 17,900 for some reason). I'm not sure how many pages you have and how this compares...

Some posters on WebmasterWorld have suggested that static looking URLs are better indexed by Google, a typical thread goes like this, with WebGuerrilla putting forward the "better as a static page" comment. Point 14 of the Deepbot and Freshbot FAQ covers the point I made above on what Google might find difficult with dynamic URLs ("id=").

On another point, in a past post I pointed out that Google prefers 301 redirects to ad_redirect's 302 redirects.

26: Re: Google & Co on dynamic content (response to 1)

Posted by Dirk Gomez on 05/21/03 08:50 AM

We definitely need to dig up the cause of that. It also seems that OpenACS forum postings are much worse indexed than ArsDigita bboard postings were back then. E. g. back then I would enter something like "Oracle ORA-1555" and the odds of an aD bboard thread popping up under the first ten - well it happened once in a while.

(These days, the old ArsDigita content is completely static and was nicely indexed by Google at some point.)

Exchanging 302 with 301 should be straight-forward, right? What side effects could it have?

27: Re: Google & Co on dynamic content (response to 1)

Posted by Dirk Gomez on 05/21/03 10:19 AM

Here's an example that I recently tried on google: Hrvoje Niksic noquote - gives a result on ccm.redhat.com and it should give a few results on openacs.org, but doesn't.

28: Re: Google & Co on dynamic content (response to 1)

Posted by Dirk Gomez on 05/21/03 10:22 AM

Ah it probably has to do with forums' forum-view page. The crawler would have to dig *really* deep and long to find 2 year old forum postings on openacs.org.

Maybe we should add an "All Messages" link to forum-view?

29: Re: Google & Co on dynamic content (response to 28)

Posted by Jeff Davis on 05/21/03 10:36 AM

I think the lack of google indexing on openacs.org has more to do with the robots.txt file:

User-agent: *
Disallow: /

I added this a few months ago before the new memory was added since every time the site was spidered it would fall over (related to the memory leak I think).

We can change it back and see if spidering is ok but someone will need to keep an eye on things.

30: Re: Google & Co on dynamic content (response to 29)

Posted by Richard Hamilton on 05/21/03 11:35 AM

Quite a lot of info in this thread so can I just clarify a couple of points please.

1) Are we saying that the robots detection module that Philip Greenspun wrote about in 'The Book' has no place in the ACS anymore because cloaking pages is considered underhand and will result in your site being blacklisted?
2) Are we also concluding that some of the assumptions that Philip made in relation to content that the search engines would not index are no longer correct (such as the impact of frames and other pretty content) and that search engines will now follow links with extensions such as .tcl, .cgi and others and do their best to index the content thereby rendering the acs robots module redundant?
3) Someone earlier suggested a link to all postings somewhere visually inconspicuous on the site. Is this a good idea or can anyone think of a better way to do it?

Regards
Richard

32: Re: Google & Co on dynamic content (response to 1)

Posted by Dave Bauer on 05/21/03 12:50 PM

Dirk

I would hope that the 2 year old threads would have been indexed when they were new.

31: Re: Google &amp; Co on dynamic content (response to 30)

Posted by Tilmann Singer on 05/21/03 01:54 PM

Regarding the cloaking issue I remember that once a competitors website did that. They registered a bunch of bogus domains and created hundreds of sub-domains, which all redirected to the competitors main page normally, but when called with a user agent of google (I actually tried it myself), they returned a list of links to all the other bogus websites instead, thus trying to fool the page rank algorythm.

I mailed google and they said they were working on techniques to automatically detect and ban such sites, and banned the offending one manually. That was a few years ago, and the source for my assumption that returning different content based on googlebot user-agent header might be a bad idea. It might as well be though that they have a way of distinguishing between sites that try to fool the pagerank mechanism and those that only return more search engine friendly content, although I can't imagine how that would work 100% reliably.

Anyway, speculations about google behaviour could be continued endless I guess, but that is not my intention. Ok, a last one: I think if we remove the restriction in robots.txt (and the site doesn't fall over when being indexed) then google will index the full site including all postings after some time, and neither query variables nor the paginator on the forums index page will scare it away.

33: Re: Google & Co on dynamic content (response to 1)

Posted by Michael Bluett on 05/21/03 03:00 PM

Regarding Richard's points:
1)Yes.
2)Search Engines are not born equal, some search engines are worse at finding content on sites. Google will follow links that have extensions such as .tcl. It does at least worry about query strings where a value of id is set, as it is possibly a session id.
3)For a site such as greenpeace.org, and possibly openacs.org, it is probably worth having pages dedicated to pointing search engines at the content you want seen on the site (a site map for search engines).

Tilmann:
Google has a spam reporting page.

34: Re: Google &amp; Co on dynamic content (response to 28)

Posted by Andrew Piskorski on 05/26/03 02:43 AM

Dirk, adding a "show all threads" link to the forum-view page sounds like a great idea to try. Has anyone volunteered to do it?

35: Re: Google & Co on dynamic content (response to 1)

Posted by Bjorn Thor Jonsson on 05/28/03 03:44 AM

Just stumbled on this:

What Google Leaves Out
http://www.microdocs-news.info/newsGoogle/2003/05/10.html

Regarding email mangling, here's a javascript i've used:

var sUsername="user";
var sDomain="example.org";
var sSeparator="@";
var concat=sUsername+sSeparator+sDomain;
document.write(concat.link("mailto:"+concat));

Maybe a jpeg is more useful, but I don't get spam :) (but I recently received a couple of scam messages to the address I use solely at openacs.org, maybe it was harvested by a human)

36: Re: Google &amp; Co on dynamic content (response to 6)

Posted by Chris Davies on 08/18/03 05:37 AM

really, what are you expecting?

Google looks at your entry page, gets a Location Redirect, to a host that doesn't exist. NOTE the Location: header that is sent to google's bot. Also note that the href on the page is invalid as well.

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.0 302 Found
Set-Cookie: ad_session_id=38906532%2c0%20%7b296%201061178092%20AB09A976A688F9EB525E86919C2AC1E120F4861B%7d; Path=/; Max-Age=1200
Location: http://greenpeace-01.fra.de.colt-isc.net/homepage
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:21:32 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 348
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://greenpeace-01.fra.de.colt-isc.net/homepage">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>

mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns1.de.colt.net
greenpeace-01.fra.de.colt-isc.net A record currently not present at ns1.de.colt.net
mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns0.de.colt.net
greenpeace-01.fra.de.colt-isc.net does not exist at ns0.de.colt.net (Authoritative answer)

I'm not surprised it doesn't follow properly.

So, you try HTTP/1.1

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.1
Host: greenpeace.org

HTTP/1.0 302 Found
Location: http://www.greenpeace.org/international_en/
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:24:53 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 342
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://www.greenpeace.org/international_en/">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>
Connection closed by foreign host.

And this time is presented with a valid redirection and a valid URL in the HREF.

Google's initial crawl uses HTTP/1.0 -- these are some hits pulled from my logs on another site.

Log.20030801:218.145.25.78 - - [01/Aug/2003:06:11:06 -0300] "GET /robots.txt HTTP/1.0" 404 8092 "-" "GoogleBot"
Log.20030801:218.145.25.78 - - [01/Aug/2003:06:36:47 -0300] "GET /robots.txt HTTP/1.0" 404 8084 "-" "GoogleBot"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /robots.txt HTTP/1.0" 404 8085 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /robots.txt HTTP/1.0" 404 8083 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /robots.txt HTTP/1.0" 404 8097 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /Tools/Unix/index.html HTTP/1.0" 200 11363 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

Note that Google's requests are HTTP/1.0. Might be worthwhile making sure your site works for browsers that are not HTTP/1.1 compliant.

I would bet this has much more to do with google not spidering your site than dynamic content.

While ? and & are inherently evil giveaways to dynamic content, and google seems to penalize slightly for those -- or flags it as dynamic and sets a maximum number of pages to spider, I have dynamic sites (not using AOL server) that are 40000+ pages and indexed in google.

With that said, 302 redirects are evil for entry pages -- I had a site that appeared to be banned specifically because I used some session tracking code that did a check to see if the cookie was set, if not, it munged the URL -- so, when google didn't respond with the cookie, a double redirect and a munged url resulted. It took 14 months of convincing Google that I wasn't doing anything stealthy and a big rewrite of code to fix things.

So, check that HTTP/1.0 response that you're handing, fix that, and I'll bet you get back in google after a few crawls.

37: Re: Google &amp;amp; Co on dynamic content (response to 36)

Posted by Brian Fitzgearld on 10/23/03 05:50 PM

Whoa. A belated THANKS, Chris. Fantastic bit of forensics, much obliged.

38: Re: Google &amp; Co on dynamic content (response to 3)

Posted by Klyde Beattie on 12/19/06 06:59 AM

_and if not then they assume that you are trying to cheat them._

They have actual people do this and if they see that you are only trying to serve the page in a way that is more google friendly they will most likely not blacklist you.