Forum OpenACS Development: OpenACS sucks on Google!

Collapse
20: OpenACS sucks on Google! (response to 1)
Posted by Andrew Piskorski on
Last I checked, Google does an incredibly lousy job of indexing openacs.org. AFAICT none of the Forums threads show up at all - not one! And no, I'm not necessarily volunteering to fix this, but I suggest that making openacs.org Google-friendly would be vastly more productive than re-writing the home page.

Eric Wolfram has some good info on that, and he posted more of it here in the Forums. And note the link at the top of that post to the previous big Google discussion, too.

If I remember correctly, back when openacs.org was running OpenaCS 3.x, it was much better indexed by Google. Maybe the big "no forums threads" snafu happened when cutting over from the 3.x BBoard to the 4.x Forums package (which of course was a long time ago), but I don't really know.

Collapse
Posted by Bart Teeuwisse on
I wouldn't be suprised if the poor ranking of openacs.org -the forums in particular- is due to the use pages that require query parameters. Forums for example uses forum-view and message-view both of which take an ID to locate the forum/message to display.

As Eric Wolfram observes in http://wolfram.org/writing/howto/3_1.html directory index pages rank better and are more frequently indexed by Google. I can attest to that too. About 2 weeks ago I added a 'GNU Arch' section to 'the Code Mill' (http://www.thecodemill.biz/services/arch/). At the same time I started hosting the #arch IRC logs (http://www.thecodemill.biz/services/arch/irc/).

A fortnight later these pages rank within the top 10 Google search for 'GNU Arch IRC' (http://www.google.com/search?&q=gnu+arch+irc). At the same time the latest daily log (not an index page) that google appears to know is over week old.

Changing the forums package to use .vuh files instead of forum-view or message view would probably make quite a difference.

/Bart

Collapse
Posted by Joel Aufrecht on
Hmm - the number one link at the moment on Google for http://www.google.com/search?&q=gnu+arch+irc leads to is a page with a query parameter in the URL:

www.thecodemill.biz/publications/blog/one-entry?entry_id=9769

So either something's changing or the query parameter is not a 100% fatal problem.

Also noteworthy is that Bart's page returns a 200, not a 301 or 302, and that it requires a host: tag. Can someone explain to me why a host: tag is required and how my browser knows to use it (I had to use "view HTTP Headers" in the Mozilla Web Development Toolbar to find out that difference between a sucessful mozilla page view and a failed telnet :80 page view.

joel@joel-desktop joel: telnet www.thecodemill.biz 80
Trying 66.92.28.174...
Connected to dsl092-028-174.sfo4.dsl.speakeasy.net.
Escape character is '^]'.
GET /publications/blog/one-entry?entry_id=9769 HTTP/1.0

HTTP/1.0 503 Service Unavailable
Content-Type: text/html
Content-Length: 169

<html><head><title>503 Service Unavailable</title></head><body><h1>503 Service Unavailable</h1><p>The service is not available. Please try again later.</p></body></html>Connection closed by foreign host.

joel@joel-desktop joel.import: telnet www.thecodemill.biz 80
Trying 66.92.28.174...
Connected to dsl092-028-174.sfo4.dsl.speakeasy.net.
Escape character is '^]'.
GET /publications/blog/one-entry?entry_id=9769 HTTP/1.0
Host: www.thecodemill.biz

HTTP/1.0 200 OK
Set-Cookie: ad_session_id=6485003%2c0 %7b934 1075884795 4C0E03284109752C174B64754F47C1B355011360%7d; Path=/; Max-Age=1200
MIME-Version: 1.0
Date: Wed, 04 Feb 2004 08:33:15 GMT
Server: AOLserver/4.1
Content-Type: text/html
Content-Length: 14286
Connection: close

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
 <html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
    <title>the Code Mill | GNU Arch IRC logs on-line</title>
    <link href="/style/sheets/thecodemill.css" rel="stylesheet" type="text/css">
    <style type="text/css" media="screen">@import "/publications/photos/photo.css";</style>
    <script src="/scripts/searchhi.js" type="text/javascript"></script>

  </head>

(rest of page omitted)
Collapse
Posted by Bart Teeuwisse on
Interesting Joel,

when I googled for 'gnu arch irc' a few hours prior to your search the first 2 hits in thecodemill.biz domain were /services/arch and /services/arch/irc, NOT the blog entry that tops the search result set now.

The reason that you have to include the host header is because thecodemill.biz is behind a hostname based proxy: pound. In other words thecodemill.biz is a virtual server.

/Bart

Collapse
Posted by Jeff Davis on
Joel, Bart's site is vhosted (I think he uses Pound but I am sure he will pipe in). When something is vhosted if you don't send the Host header it has no way to tell which server to talk to.

As for indexing pages with query variables, google definitely does, although it limits the number of pages it gets this way according to their information for webmasters:

1. Reasons your site may not be included.
Your pages are dynamically generated. We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index.
Collapse
Posted by Steve Manning on
<blockquote>Can someone explain to me why a host: tag is required and how my browser knows to use it.
</blockquote>

Isn't it the case that your browser always has to send it if it is sending a HTTP/1.1 request: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.23

"A client MUST include a Host header field in all HTTP/1.1 request messages . If the requested URI does not include an Internet host name for the service being requested, then the Host header field MUST be given with an empty value. An HTTP/1.1 proxy MUST ensure that any request message it forwards does contain an appropriate Host header field that identifies the service being requested by the proxy. All Internet-based HTTP/1.1 servers MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message which lacks a Host header field."

    Steve