Forum OpenACS Q&A: Re: Google & Co on dynamic content

Collapse
Posted by Chris Davies on
really, what are you expecting?

Google looks at your entry page, gets a Location Redirect, to a host that doesn't exist.  NOTE the Location: header that is sent to google's bot.  Also note that the href on the page is invalid as well.

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.0 302 Found
Set-Cookie: ad_session_id=38906532%2c0%20%7b296%201061178092%20AB09A976A688F9EB525E86919C2AC1E120F4861B%7d; Path=/; Max-Age=1200
Location: http://greenpeace-01.fra.de.colt-isc.net/homepage
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:21:32 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 348
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://greenpeace-01.fra.de.colt-isc.net/homepage">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>

mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns1.de.colt.net
greenpeace-01.fra.de.colt-isc.net A record currently not present at ns1.de.colt.net
mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns0.de.colt.net
greenpeace-01.fra.de.colt-isc.net does not exist at ns0.de.colt.net (Authoritative answer)

I'm not surprised it doesn't follow properly.

So, you try HTTP/1.1

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.1
Host: greenpeace.org

HTTP/1.0 302 Found
Location: http://www.greenpeace.org/international_en/
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:24:53 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 342
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://www.greenpeace.org/international_en/">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>
Connection closed by foreign host.

And this time is presented with a valid redirection and a valid URL in the HREF.

Google's initial crawl uses HTTP/1.0 -- these are some hits pulled from my logs on another site.

Log.20030801:218.145.25.78 - - [01/Aug/2003:06:11:06 -0300] "GET /robots.txt HTTP/1.0" 404 8092 "-" "GoogleBot"
Log.20030801:218.145.25.78 - - [01/Aug/2003:06:36:47 -0300] "GET /robots.txt HTTP/1.0" 404 8084 "-" "GoogleBot"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /robots.txt HTTP/1.0" 404 8085 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /robots.txt HTTP/1.0" 404 8083 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /robots.txt HTTP/1.0" 404 8097 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /Tools/Unix/index.html HTTP/1.0" 200 11363 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

Note that Google's requests are HTTP/1.0. Might be worthwhile making sure your site works for browsers that are not HTTP/1.1 compliant.

I would bet this has much more to do with google not spidering your site than dynamic content.

While ? and & are inherently evil giveaways to dynamic content, and google seems to penalize slightly for those -- or flags it as dynamic and sets a maximum number of pages to spider, I have dynamic sites (not using AOL server) that are 40000+ pages and indexed in google.

With that said, 302 redirects are evil for entry pages -- I had a site that appeared to be banned specifically because I used some session tracking code that did a check to see if the cookie was set, if not, it munged the URL -- so, when google didn't respond with the cookie, a double redirect and a munged url resulted.  It took 14 months of convincing Google that I wasn't doing anything stealthy and a big rewrite of code to fix things.

So, check that HTTP/1.0 response that you're handing, fix that, and I'll bet you get back in google after a few crawls.

Collapse
Posted by Brian Fitzgearld on
Whoa. A belated THANKS, Chris.  Fantastic bit of forensics, much obliged.