Forum OpenACS Development: Re: OpenACS sucks on Google!

Collapse
Posted by Joel Aufrecht on
Hmm - the number one link at the moment on Google for http://www.google.com/search?&q=gnu+arch+irc leads to is a page with a query parameter in the URL:

www.thecodemill.biz/publications/blog/one-entry?entry_id=9769

So either something's changing or the query parameter is not a 100% fatal problem.

Also noteworthy is that Bart's page returns a 200, not a 301 or 302, and that it requires a host: tag. Can someone explain to me why a host: tag is required and how my browser knows to use it (I had to use "view HTTP Headers" in the Mozilla Web Development Toolbar to find out that difference between a sucessful mozilla page view and a failed telnet :80 page view.

joel@joel-desktop joel: telnet www.thecodemill.biz 80
Trying 66.92.28.174...
Connected to dsl092-028-174.sfo4.dsl.speakeasy.net.
Escape character is '^]'.
GET /publications/blog/one-entry?entry_id=9769 HTTP/1.0

HTTP/1.0 503 Service Unavailable
Content-Type: text/html
Content-Length: 169

<html><head><title>503 Service Unavailable</title></head><body><h1>503 Service Unavailable</h1><p>The service is not available. Please try again later.</p></body></html>Connection closed by foreign host.

joel@joel-desktop joel.import: telnet www.thecodemill.biz 80
Trying 66.92.28.174...
Connected to dsl092-028-174.sfo4.dsl.speakeasy.net.
Escape character is '^]'.
GET /publications/blog/one-entry?entry_id=9769 HTTP/1.0
Host: www.thecodemill.biz

HTTP/1.0 200 OK
Set-Cookie: ad_session_id=6485003%2c0 %7b934 1075884795 4C0E03284109752C174B64754F47C1B355011360%7d; Path=/; Max-Age=1200
MIME-Version: 1.0
Date: Wed, 04 Feb 2004 08:33:15 GMT
Server: AOLserver/4.1
Content-Type: text/html
Content-Length: 14286
Connection: close

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
 <html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
    <title>the Code Mill | GNU Arch IRC logs on-line</title>
    <link href="/style/sheets/thecodemill.css" rel="stylesheet" type="text/css">
    <style type="text/css" media="screen">@import "/publications/photos/photo.css";</style>
    <script src="/scripts/searchhi.js" type="text/javascript"></script>

  </head>

(rest of page omitted)
Collapse
Posted by Bart Teeuwisse on
Interesting Joel,

when I googled for 'gnu arch irc' a few hours prior to your search the first 2 hits in thecodemill.biz domain were /services/arch and /services/arch/irc, NOT the blog entry that tops the search result set now.

The reason that you have to include the host header is because thecodemill.biz is behind a hostname based proxy: pound. In other words thecodemill.biz is a virtual server.

/Bart

Collapse
Posted by Jeff Davis on
Joel, Bart's site is vhosted (I think he uses Pound but I am sure he will pipe in). When something is vhosted if you don't send the Host header it has no way to tell which server to talk to.

As for indexing pages with query variables, google definitely does, although it limits the number of pages it gets this way according to their information for webmasters:

1. Reasons your site may not be included.
Your pages are dynamically generated. We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index.
Collapse
Posted by Steve Manning on
<blockquote>Can someone explain to me why a host: tag is required and how my browser knows to use it.
</blockquote>

Isn't it the case that your browser always has to send it if it is sending a HTTP/1.1 request: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.23

"A client MUST include a Host header field in all HTTP/1.1 request messages . If the requested URI does not include an Internet host name for the service being requested, then the Host header field MUST be given with an empty value. An HTTP/1.1 proxy MUST ensure that any request message it forwards does contain an appropriate Host header field that identifies the service being requested by the proxy. All Internet-based HTTP/1.1 servers MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message which lacks a Host header field."

    Steve