Forum OpenACS Q&A: Google does NOT index BBoard?

Request notifications

Collapse
Posted by Andrew Piskorski on
It appears that all the individual posts under http://openacs.org/bboard/ are not indexed by Google at all.

For example, Google for site:openacs.org TUX. You'll find all the openacs.org/shared/community-member.tcl?user_id= pages for everyone who posted to a thread with "TUX" in the title, and at the bottom, Google will even give you a link to the a BBoard search of "apache vs%": openacs.org/bboard/search-entire-system.tcl?query_string=apache%20vs%25

But Google will NOT give you a link to the actual request for review: TUX and nsd document thread on BBoard!

Then try some variations on the above Google, search, designed to match exactly the BBoard thread itself:

Collapse
Posted by Tom Jackson on

Maybe look into what .vuh files will do. Maybe you can turn q-and-a-fetch-msg.tcl?msg_id=0006Mv&topic_id=11&topic=OpenACS into /bboard/OpenACS/q-and-a-fetch-msg/11/006Mv/ or something similar.

Collapse
Posted by Dave Bauer on
Google definitely does not exclude links that use query parameters.
Collapse
Posted by Jeff Davis on
Google does index pages with query variables, and there are 50 or 60 threads that got indexed, just not all of them.  It might be that the
server fell over when it got spidered.
Collapse
Posted by Dave Bauer on
Interestingly sdm pages seem to all by indexed.
Collapse
Posted by Jonathan Ellis on
I just checked and the carnageblender bboard is indexed.  Like Jeff says, maybe the server fell over, or maybe openacs.org runs some code to block out spiders.
Collapse
Posted by Tom Jackson on

I've had Google request over two thousand pages in under three minutes before. Sometimes 30 Googlebots grab stuff at one time. Googlebot is a very unfriendly spider for sites with lots of pages but limited hardware. Also, dispite its ability to grab urls with query vars, it seems to lose interest in a site without grabbing everything. I wonder if it limits depth, maybe using the number of query vars as a substitute for depth.

Collapse
Posted by Vadim Makarov on
I guess that Google has lower indexing priority for pages with query variables, because it thinks they are less likely to contain stuff of long-term value. At least one Russian search engine says this explicitly in the FAQ, adding it will index more "?"-containing URLs on a site with a higher linking rank. Another search engine says it ignores all dynamically generated pages *.asp*, *.php*, *.pl*, */cgi-bin/* etc.

Search engines think that "plain HTML" URLs are usually authored by hand and are better literature to search for.

Also, URLs without parameters tend to be shorter are more human-readable. I can understand that

http://dev.openacs.org/forums/forum-view?forum_id=14013
makes sense to those who program, but to the rest of users
http://dev.openacs.org/forums/OpenACS/
or at least
http://dev.openacs.org/forums/1/
would look way more logical.

Collapse
Posted by Lars Pind on
That's why I've been advocating developing packages where one forum is one package instance, then you'd get URLs like /forums/openacs/.

The other advantage is that you get a URL that you can actually type that takes you directly to where you want to be, instead of to a page where you have to click to get to where you want to be.

/Lars

Collapse
Posted by Jon Griffin on
Lars,
Do you have some more info on your idea?

I would be interested in doing something like that.

Collapse
Posted by Lars Pind on
Look at bug-tracker and lars-blogger, they work that way. It's just a question of *not* writing another layer of indirection.

/Lars

Collapse
Posted by Jonathan Ellis on
If the bboard package only knew about a single forum's messages, it would be difficult to allow admins to move messages to a more appropriate forum...
Collapse
Posted by Lars Pind on
True, but it's still just as hard to move a message to a more appropriate forum if that forum happens to belong to another instance of the "forums" package. So there's still a more general problem left to solve.

This particular problem could be solved by allowing admins to move a message to another forum that sits at the same level in the site map. That would be easy to implement, and we'd have status quo.

The tricky part is when you want to move to another instance in another place in the site map, because you have to show the context. Say they're all called "Forum", but one sits below "Project A" and another below "Project B". You'd have to include that context for people to figure out which one they want.

This is not a super-complex problem by any means. It's just not *quite* as trivial as when we're staying within one context.

/Lars

Collapse
Posted by Andrew Piskorski on
I think this URL format issue is more general than, and mostly orthogonal to, the question of whether each BBoard forum should be a separate package instance or not.

Maybe that change alone would get our BBoard threads indexed by Google. But maybe not. And even if it does, there is no guarantee that Google - or any other search engine - will continue to handle URLs the same way they do right now.

Ideally, anything that we want to look like static content to a search engine should be presented with a static-looking URL. E.g., no ? or & in the URL, query variables are instead embedded implicitly between / symbols.

Now, if you have that new "static content" URL scheme working for, say, BBoard, it makes sense to use only the new "static content" URLs for BBoard. There's no good reason to have the human users see one sort of URL and search engine robots see another, because if they do, then when a person manually links to something from their homepage, they'll be using the URL format that the search engine doesn't like - not good.

Whatever this tool to eliminate query variables from the URL is, should it not be powerfull enough to use for the entire toolkit if we so desire, even if we choose to use it only in certain specific targetted applications? Has this been discussed/designed before?

Collapse
Posted by Andrew Piskorski on
Let me back up for a minute: All that stuff about URL formats is assuming that the format of the URL, whether it looks "static" or "dynamic", really is important to whether or not real search engines out there index the content.

Some posts above imply that, at least for Google, that may not actually be true, or may only be partially true.

Whatever else, other than the URL format, impacts whether or not Google indexes stuff (like the server maybe falling over under massive Googlebot load), is at least (probably more) important than the URL format issue itself.

But if it's likely that the URL will always be at least a piece of the puzzle, it would be very nice to have a tool to present whatever URL format we want, and that's worth discussing.

Collapse
Posted by Lars Pind on
Andrew,

For the record, I wasn't concerned about the google thing at all here, just talking about the single-forum vs multiple forums issue.

I think ridding the URLs of some of the query vars, like you're talking about, is a good thing. It's been discussed before. It's just one of those things that a patient and thorough hacker with enough time on his or her hands needs to go ahead and get done :)

/Lars

Collapse
Posted by Robert Locke on
I'm guessing Google limits the number of documents it crawls under a given URL (eg, openacs.org/bboard in this case).

Maybe it does this as a precaution against "infinite" URLs.  For instance, imagine crawling through our very own calendar application.  If you followed all the links, it would never end since you would be crawling through all the years (2003, 2004, ...)

That's my theory anyways. =)

Collapse
Posted by Dave Bauer on
Here is some information and links. It appears that google did change their system.

http://diveintomark.org/archives/2002/10/03.html#when_an_engineer_flaps_his_wings