Forum OpenACS Development: Re: OpenACS sucks on Google!

Posted by Bart Teeuwisse on
I wouldn't be suprised if the poor ranking of -the forums in particular- is due to the use pages that require query parameters. Forums for example uses forum-view and message-view both of which take an ID to locate the forum/message to display.

As Eric Wolfram observes in directory index pages rank better and are more frequently indexed by Google. I can attest to that too. About 2 weeks ago I added a 'GNU Arch' section to 'the Code Mill' ( At the same time I started hosting the #arch IRC logs (

A fortnight later these pages rank within the top 10 Google search for 'GNU Arch IRC' ( At the same time the latest daily log (not an index page) that google appears to know is over week old.

Changing the forums package to use .vuh files instead of forum-view or message view would probably make quite a difference.


Posted by Joel Aufrecht on
Hmm - the number one link at the moment on Google for leads to is a page with a query parameter in the URL:

So either something's changing or the query parameter is not a 100% fatal problem.

Also noteworthy is that Bart's page returns a 200, not a 301 or 302, and that it requires a host: tag. Can someone explain to me why a host: tag is required and how my browser knows to use it (I had to use "view HTTP Headers" in the Mozilla Web Development Toolbar to find out that difference between a sucessful mozilla page view and a failed telnet :80 page view.

joel@joel-desktop joel: telnet 80
Connected to
Escape character is '^]'.
GET /publications/blog/one-entry?entry_id=9769 HTTP/1.0

HTTP/1.0 503 Service Unavailable
Content-Type: text/html
Content-Length: 169

<html><head><title>503 Service Unavailable</title></head><body><h1>503 Service Unavailable</h1><p>The service is not available. Please try again later.</p></body></html>Connection closed by foreign host.

joel@joel-desktop joel.import: telnet 80
Connected to
Escape character is '^]'.
GET /publications/blog/one-entry?entry_id=9769 HTTP/1.0

HTTP/1.0 200 OK
Set-Cookie: ad_session_id=6485003%2c0 %7b934 1075884795 4C0E03284109752C174B64754F47C1B355011360%7d; Path=/; Max-Age=1200
MIME-Version: 1.0
Date: Wed, 04 Feb 2004 08:33:15 GMT
Server: AOLserver/4.1
Content-Type: text/html
Content-Length: 14286
Connection: close

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
    <title>the Code Mill | GNU Arch IRC logs on-line</title>
    <link href="/style/sheets/thecodemill.css" rel="stylesheet" type="text/css">
    <style type="text/css" media="screen">@import "/publications/photos/photo.css";</style>
    <script src="/scripts/searchhi.js" type="text/javascript"></script>


(rest of page omitted)
Posted by Bart Teeuwisse on
Interesting Joel,

when I googled for 'gnu arch irc' a few hours prior to your search the first 2 hits in domain were /services/arch and /services/arch/irc, NOT the blog entry that tops the search result set now.

The reason that you have to include the host header is because is behind a hostname based proxy: pound. In other words is a virtual server.


Posted by Jeff Davis on
Joel, Bart's site is vhosted (I think he uses Pound but I am sure he will pipe in). When something is vhosted if you don't send the Host header it has no way to tell which server to talk to.

As for indexing pages with query variables, google definitely does, although it limits the number of pages it gets this way according to their information for webmasters:

1. Reasons your site may not be included.
Your pages are dynamically generated. We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index.
Posted by Steve Manning on
<blockquote>Can someone explain to me why a host: tag is required and how my browser knows to use it.

Isn't it the case that your browser always has to send it if it is sending a HTTP/1.1 request:

"A client MUST include a Host header field in all HTTP/1.1 request messages . If the requested URI does not include an Internet host name for the service being requested, then the Host header field MUST be given with an empty value. An HTTP/1.1 proxy MUST ensure that any request message it forwards does contain an appropriate Host header field that identifies the service being requested by the proxy. All Internet-based HTTP/1.1 servers MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message which lacks a Host header field."