Forum OpenACS Development: Re: OpenACS sucks on Google!
As Eric Wolfram observes in http://wolfram.org/writing/howto/3_1.html directory index pages rank better and are more frequently indexed by Google. I can attest to that too. About 2 weeks ago I added a 'GNU Arch' section to 'the Code Mill' (http://www.thecodemill.biz/services/arch/). At the same time I started hosting the #arch IRC logs (http://www.thecodemill.biz/services/arch/irc/).
A fortnight later these pages rank within the top 10 Google search for 'GNU Arch IRC' (http://www.google.com/search?&q=gnu+arch+irc). At the same time the latest daily log (not an index page) that google appears to know is over week old.
Changing the forums package to use .vuh files instead of forum-view or message view would probably make quite a difference.
So either something's changing or the query parameter is not a 100% fatal problem.
Also noteworthy is that Bart's page returns a 200, not a 301 or 302, and that it requires a host: tag. Can someone explain to me why a host: tag is required and how my browser knows to use it (I had to use "view HTTP Headers" in the Mozilla Web Development Toolbar to find out that difference between a sucessful mozilla page view and a failed telnet :80 page view.
when I googled for 'gnu arch irc' a few hours prior to your search the first 2 hits in thecodemill.biz domain were /services/arch and /services/arch/irc, NOT the blog entry that tops the search result set now.
The reason that you have to include the host header is because thecodemill.biz is behind a hostname based proxy: pound. In other words thecodemill.biz is a virtual server.
As for indexing pages with query variables, google definitely does, although it limits the number of pages it gets this way according to their information for webmasters:
1. Reasons your site may not be included.
Your pages are dynamically generated. We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index.
Isn't it the case that your browser always has to send it if it is sending a HTTP/1.1 request: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.23
"A client MUST include a Host header field in all HTTP/1.1 request messages . If the requested URI does not include an Internet host name for the service being requested, then the Host header field MUST be given with an empty value. An HTTP/1.1 proxy MUST ensure that any request message it forwards does contain an appropriate Host header field that identifies the service being requested by the proxy. All Internet-based HTTP/1.1 servers MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message which lacks a Host header field."