Forum OpenACS Q&A: Re: Google & Co on dynamic content

Collapse
Posted by Brian Fitzgearld on
I'm disappointed in the response to this and other threads about Google and OpenACS.  In our experience at Greenpeace, Google is doing an appalling job of indexing the site since we switched to OpenACS.  It's a massive drawback, and I would think a major barrier to wider acceptance of OpenACS.

While I accept that it may be a generic problem to dynamic content sites, I don't know many major sites that would put up with their content living out their in the googleless void for long -- there must be solutions, and if there's one that applies to OpenACS sites, let's hear it.  I'm yet to see a post from anyone adequately acknowledging this problem or suggesting a fix. Tillman's suggestion is the closest thing to a workaround I've seen, but if it gets you banned from google, you might as well be banned from the web.

This is a biggy.

--b

(P.S. Aw jeepers, my first post to this forum, and it has to be a rant. Howdy, y'all)

Collapse
Posted by Chris Davies on
really, what are you expecting?

Google looks at your entry page, gets a Location Redirect, to a host that doesn't exist.  NOTE the Location: header that is sent to google's bot.  Also note that the href on the page is invalid as well.

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.0 302 Found
Set-Cookie: ad_session_id=38906532%2c0%20%7b296%201061178092%20AB09A976A688F9EB525E86919C2AC1E120F4861B%7d; Path=/; Max-Age=1200
Location: http://greenpeace-01.fra.de.colt-isc.net/homepage
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:21:32 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 348
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://greenpeace-01.fra.de.colt-isc.net/homepage">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>

mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns1.de.colt.net
greenpeace-01.fra.de.colt-isc.net A record currently not present at ns1.de.colt.net
mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns0.de.colt.net
greenpeace-01.fra.de.colt-isc.net does not exist at ns0.de.colt.net (Authoritative answer)

I'm not surprised it doesn't follow properly.

So, you try HTTP/1.1

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.1
Host: greenpeace.org

HTTP/1.0 302 Found
Location: http://www.greenpeace.org/international_en/
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:24:53 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 342
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://www.greenpeace.org/international_en/">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>
Connection closed by foreign host.

And this time is presented with a valid redirection and a valid URL in the HREF.

Google's initial crawl uses HTTP/1.0 -- these are some hits pulled from my logs on another site.

Log.20030801:218.145.25.78 - - [01/Aug/2003:06:11:06 -0300] "GET /robots.txt HTTP/1.0" 404 8092 "-" "GoogleBot"
Log.20030801:218.145.25.78 - - [01/Aug/2003:06:36:47 -0300] "GET /robots.txt HTTP/1.0" 404 8084 "-" "GoogleBot"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /robots.txt HTTP/1.0" 404 8085 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /robots.txt HTTP/1.0" 404 8083 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /robots.txt HTTP/1.0" 404 8097 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /Tools/Unix/index.html HTTP/1.0" 200 11363 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

Note that Google's requests are HTTP/1.0. Might be worthwhile making sure your site works for browsers that are not HTTP/1.1 compliant.

I would bet this has much more to do with google not spidering your site than dynamic content.

While ? and & are inherently evil giveaways to dynamic content, and google seems to penalize slightly for those -- or flags it as dynamic and sets a maximum number of pages to spider, I have dynamic sites (not using AOL server) that are 40000+ pages and indexed in google.

With that said, 302 redirects are evil for entry pages -- I had a site that appeared to be banned specifically because I used some session tracking code that did a check to see if the cookie was set, if not, it munged the URL -- so, when google didn't respond with the cookie, a double redirect and a munged url resulted.  It took 14 months of convincing Google that I wasn't doing anything stealthy and a big rewrite of code to fix things.

So, check that HTTP/1.0 response that you're handing, fix that, and I'll bet you get back in google after a few crawls.

Collapse
Posted by Brian Fitzgearld on
Whoa. A belated THANKS, Chris.  Fantastic bit of forensics, much obliged.