Forum OpenACS Q&A: Google & Co on dynamic content

Collapse
Posted by Christof Spitz on
Could some wise person tell if search engines like Google etc. find dynamically created content? That would probably mean they somehow "hit" the *.tcl script and scan the resultant html-page?

It's because of the question of a participant in our online community if robots can somehow extract email-addresses out of an openacs installation. And it would be interesting to know if our dynamic content appears in a Google search, because if not, this could be a disadvantage if you want your site to be found.

(Sorry, I am quite ignorant about the technical background.)

Collapse
Posted by David Cotter on
Christof

Google just browses your site following hyperlinks that appear on pages. So on the openacs honmepage, for example, there are links to messages on the forums which google can retrieve.

Google can't login to the system though so it can only see content that is available to non-registered users.

If you click on the name of a poster on this forum you will not see their email address if you're not logged in and neither will Google.

As a developer you can detect if the visitor is Google and either block it or server it different content etc.

Collapse
Posted by Tilmann Singer on
_As a developer you can detect if the visitor is Google and either block it or server it different content etc._

This is called cloaking and will most likely get you blocked from google indexing totally, because they test from time to time with non-detectable user-agents and ip addresses to see if you are serving the same content to the indexer and to the users, and if not then they assume that you are trying to cheat them.

Collapse
Posted by Tom Jackson on

You used to have to worry about having query strings in your public urls, because search engines would not index them, for fear of falling into an infinitly deep sub-web.

Google now indexes any link on your page, even javascript links. However, they only go to a certain depth on any site now. They believe that any content of importance is within a few levels of an index page.

I haven't figured out how they know what an index page is, but it is interesting to listen to them pontificate on their bot technology. Put up a large site and eventually googlebot will visit like _Attack of the Clones_.

This might have implications for using pages like index.vuh, where the same page will occur at different depths, or under many subdirectories in a package. I don't think robots.txt allows wildcards in the path portion of a url.

Collapse
Posted by James Thornton on
Google now indexes any link on your page, even javascript links.

First that I have heard that they have started following JavaScript.

Google indexes dynamic content, to a point. The less parameters in the URL, the more-likely Google will index it. Also, if the links to the dynamic pages are from a static page (i.e., no params), there is a better chance Google will index it, and Google's capabilities in this regard are improving all the time.

Collapse
Posted by Brian Fitzgearld on
I'm disappointed in the response to this and other threads about Google and OpenACS.  In our experience at Greenpeace, Google is doing an appalling job of indexing the site since we switched to OpenACS.  It's a massive drawback, and I would think a major barrier to wider acceptance of OpenACS.

While I accept that it may be a generic problem to dynamic content sites, I don't know many major sites that would put up with their content living out their in the googleless void for long -- there must be solutions, and if there's one that applies to OpenACS sites, let's hear it.  I'm yet to see a post from anyone adequately acknowledging this problem or suggesting a fix. Tillman's suggestion is the closest thing to a workaround I've seen, but if it gets you banned from google, you might as well be banned from the web.

This is a biggy.

--b

(P.S. Aw jeepers, my first post to this forum, and it has to be a rant. Howdy, y'all)

Collapse
Posted by Dirk Gomez on
Someone at the OpenACS social said that most search engines would just cut off the query string...

What about offering an alternative format e.g.

http://openacs.org/forums/message-view/100650

That should be more search-engine compliant right?

Collapse
Posted by Dave Bauer on
Brain,

I agree with Dirk. Setting up index.vuh files to interpret the URLs is reasonably easy and could be used on the greenpeace web site.

I haven't personally seen any problems with my personal site, checking the logs, google and every other search engine on the planet have visited and indexed the whole thing. But that might have to do with the fact that it is not very deep.

Collapse
Posted by Andrew Herne on
Like Brian a first poster here. I've been lurking too long. Hi all.

Introducing myself. I'm IT Director (and yes that means programmer too!) at National Extension College, a distance learning not-for-profit in the UK.

http://www.nec.ac.uk

We've been running a very modified ACS 3.4 since summer 2001. Always heads down and no time to chat. [I know, lame excuse.]

This may or may not be relevant. I think it's the idea not the detail that may be helpful. We had big problems with Google mid 2002, losing all presence. Tracking down the problem proved tortuous, and we never really got to the bottom of it.

We realised that user sessions were implicated, and in particular the usca_p query string appended to ACS 3.4 ecommerce pages. We use a modified ecommerce module for most of our public pages as a rudimentary CMS.

We were able to recover by hacking a standard proc in ecommerce-defs.tcl:

ec_create_new_session_if_necessary

At the top of that proc we now list known spider user agents (googlebot, scooter, slurp, etc.) and match against return value of util_GetUserAgentHeader. If we have a match we set the user_session_id to 0 and quit the proc. It's my belief that Google gets stuck in a session loop which obviously is not relevant to it but led to it failing to spider the site.

Collapse
Posted by David Cotter on
Dave

>Setting up index.vuh files to interpret the URLs is
>reasonably easy and could be used on the greenpeace web
>site.
I don't understand how this would help (I've never used a vuh). It would be possible to get a vuh to interpret

http://openacs.org/forums/message-view/100650

as

http://openacs.org/forums/message-view?message_id=100650

but how do you present these modified URLs to google - would it not require quite a bit of work.

Collapse
Posted by Dave Bauer on
David,

I am not sure of the exact amount of work. Generally I think we want to move toward building in support for URLs like this.  We added support for the URLs without ? for forums, but I don't know if the work was finished so that all links that are presented use these "pretty" URLs instead of the standard ones.

So the adp templates what diplay links would need to be modified to support the new URLs.

Collapse
Posted by David Cotter on
I was not aware thet these types of URL were considered non-standard. Is this mostly because of search engine problems or are there other reasons why they are undesirable?
Collapse
Posted by Jerry Asher on
Getting back to the original question about harvesting email addresses, I will note that it turns out that most openacs pages, but not all, that show email addresses are only provided if the user logs in.  The ones that that leak this info are usually a bug.  (I found a lot of these in various ticket tracke bug report modules.)

That said, from time to time, I've advocated an email mangling API to be used whenever an email address is presented.  It may  obscure the email entirely or display it in a jpeg or display the @ as a gif or something like that when there is no user logged in, and it might present the email address as a mailto, or something different for when a user is logged in.

I am amused by how often a google search turns up results that include statements about how google should be using a frames compliant browser.

Collapse
Posted by Tom Jackson on

Jerry: great idea about mangling the email addresses (where they show up). Probably your proposed metadata api for AOLserver would be useful for that purpose. Personally I would like the jpeg solution, although that leaves non-visual/non-graphic UAs at a disadvantage.

I have run a large dynamic site for a number of years (saleonall.com). This is based on the ecommerce package, but I have moved the navigation and display of product information to a 'static looking' setup. Also, the parts that should not be visited by robots are in separate directories, so they are easy to exclude. Look at robots.txt .

Still, Google fails to index the entire content. However they do index some of the site, closer to the top of the directory structure, probably around 10-20k pages. When they do, they send in an army of 30 or so bots, each one more brain dead than the last. Because of this I actually create a static copy of the 150k or more pages and give those to GoogleBot.

I think I am going to create a shallow (by product_id) hirearchy so Google will be happy. Probably products will be displayed in a url like http://saleonall.com/cat/01/012345/product.html . This way the entire site is within two hops of an index page. The 00-99 pages could be built each time the database is turned over so they are static.
Collapse
Posted by Don Baccus on
Brian ... have you forgotten that we've implemented a solution that presents a URL without a "?" that can be turned on for Planet?  The first cut was implemented by Elephant 7 but didn't work.  I fixed it and at one time it was working but since "google indexes sites with '?' variables" it was decided to leave it turned off.

You folks might do some testing on a development server to make sure it still works, and if it does you might turn it on for Planet to see if google then does a better job.

For the rest of you, the solution was very simple:  It just uses a simple algorithm to mangle the URL with a "?" into a simple string, then when the request processor processes a hit  the URL is unmangled back to the original URL.

No client code needs to be written.  No index.vuh files need to be written.  URLs just need to be filtered through the mangle proc rather than placed directly in templates.

But since google supposedly doesn't care about "?" this shouldn't be necessary ...

Also, Brian ... you guys should try to analyze what google's finding and not finding on Planet.  It may be that now that you have a dynamic content system you're changing content more frequently, and that this causes Google to miss content because it expires before the site's reindexed.  Or that archive links are broken or something along those lines.

Collapse
Posted by Andrew Piskorski on
Don, Brian, is that URL mangling/demangling code in the OpenACS toolkit? If not, would Greenpeace be willing to contribute it?
Collapse
Posted by Don Baccus on
It's not in the toolkit but the entire Planet code base is available if you want it.  Most of the code's an absolute mess (Peter, Lars and I did a bunch of repair/clean-up last spring to make it functional but we left a lot of it in the awful state we inherited it in) which is why you don't find pointers to the source in flashing bright lights all over the internet. :)
Collapse
Posted by Michael Bluett on
I have a couple of points to make on this, though I haven't actually run an OpenACS installation:
The extraction of emails by robots from an OpenACS installation should be small as only the maintainer has their email address available to view. User's email addresses are hidden until the visitors log in, which robots don't. I believe that the Directory package, is vulnerable to robots as it features a list of users and their email addresses.

Google will extract the data from a website to a certain depth dependent upon the PR of a website. My advice is to read many articles on indexing at somewhere like WebMasterWorld.

Much of the problems that Google has with dynamic sites is with things like session ids being stored in the URL (for example as "id=" with php). This makes it more reluctant to follow URLs that look like they have session variables in the URL (i.e. exactly "id=")

Greenpeace sound like they need better referencing for Google, pages primarily designed not for the user, but for Google to navigate to all content on the site.

Collapse
Posted by Brian Fitzgearld on
Hey Don,

Alas, even the pages that don't contain a ? URL fail to be googled, and that includes pages that have not changed since launch.  See, for example, http://www.greenpeace.org/aboutus/

Contains the phrase "As one of the longest banners we've ever made" and has contained it since Greenpeace Planet inception.  Google is aware of this phrase only at an alternate site that quotes it.

As to what gets googled and doesn't, our experience *sounds* closer to Tom's: some top of the directory indexing and then the brainless bots get tired and go away.  I've seen articles indexed as part of the aggregating-menus (News, Features, Press) but absent themselves from google, so it may just be how many hops down the tree. There's a fairly detailed bug report on this in the GP Planet Bugzilla about what's getting hit and what's not if you're keen.

I'll ask Bruno and Alex to drop by this thread and call attention to the VUH idea and Michael's suggestions -- all look sound, but me, I only know enough to ask the question, not evaluate the answers. ;-)

Great to see the community response to this.  Thanks all.

--b

Collapse
Posted by Dirk Gomez on
This is with respect to http://openacs.org/forums/message-view?message_id=94374. In short providing a short (and stable) URL that redirects to the  "proper" page.

How do search engines - google in particular - handle redirects? Dozens of redirects on one page?

Collapse
Posted by Torben Brosten on
My understanding is that Google and other search engines don't index urls that redirect. I think its a way they use to minimize duplicate data in their indexes.  Noncompliant html HEAD code is another critical factor that stops googlebot.

Tired bots? My experience with staticly generated websites containing thousands of pages and 5 levels suggests that the bots don't get tired of specific, unique information.

The openacs static-pages package changes the page depending on the requesting client. As pointed out earlier in this thread, some searchengines, such as google, attribute this behavior to  manipulation and may not index the page or site accordingly.

Is it possible to configure static-pages so that urls are presented as static pages in all cases? Perhaps the quick and dirty method may be to tweak the registered robots list to none?  If not possible, then a package needs to be written to address these issues.

The package should interpret static looking pages and convert/present static looking urls for the domain served. Also, in a best case scenario, there should be a way to store requested, cached dynamic pages as static files so that the server can keep up with bot requests. Not delivering the pages at static-page speed may result in lost opportunities for indexing --given that there is limited time between index updates.

Collapse
Posted by brijesh singh on
Can i know who difficult it is to convert .tcl link to .HTML link. As Search engine give preference to HTML links, will it work if we can remove .tcl as my site have .tcl extension and keep it without any extension. Will still my site will get same attention.
Collapse
Posted by Robert Locke on
Hi Brian,

> See, for example, http://www.greenpeace.org/aboutus/
> Contains the phrase "As one of the longest banners we've
> ever made" and has contained it since Greenpeace Planet
> inception.  Google is aware of this phrase only at an
> alternate site that quotes it.

I did a search on Google as follows:
    "as one fo the longest banners we've" site:www.greenpeace.org

and www.greenpeace.org/aboutus/ appeared as the only result.

I'm guessing you didn't see it because Google filters out redundant results, but Google is definitely aware of the page.  Click on the "repeat the search with the omitted results included" link to see all results.

I checked a few of the "deeper" pages in the Greenpeace site and then ran a search on Google for a phrase within that page, adding "site:www.greenpeace.org" (to limit the search results).  And Google appeared to be aware of the them (atleast the ones I checked).  Google also seemed to be aware of the various versions of the page (eg, /ships/ship-detail?ship... and /international_en/ships/ship-detail?ship...

Perhaps one of the problems is your ranking within Google, which is a separate issue.  I know there are companies/software which can supposedly help in that department, but I don't know if they are reliable.

Collapse
Posted by Don Baccus on
Ahhh ... so content is getting googled on Planet ... Brian, you guys need to dig deeper.  I think, as Robert's suggesting, that you may have a ranking issue but that's another can of worms altogether.
Collapse
Posted by Michael Bluett on
There look to be "about 17,700 pages" from Greenpeace spidered on Google according to this search (This search finds 17,900 for some reason). I'm not sure how many pages you have and how this compares...

Some posters on WebmasterWorld have suggested that static looking URLs are better indexed by Google, a typical thread goes like this, with WebGuerrilla putting forward the "better as a static page" comment. Point 14 of the Deepbot and Freshbot FAQ covers the point I made above on what Google might find difficult with dynamic URLs ("id=").

On another point, in a past post I pointed out that Google prefers 301 redirects to ad_redirect's 302 redirects.

Collapse
Posted by Dirk Gomez on
We definitely need to dig up the cause of that. It also seems that OpenACS forum postings are much worse indexed than ArsDigita bboard postings were back then. E. g. back then I would enter something like "Oracle ORA-1555" and the odds of an aD bboard thread popping up under the first ten - well it happened once in a while.

(These days, the old ArsDigita content is completely static and was nicely indexed by Google at some point.)

Exchanging 302 with 301 should be straight-forward, right? What side effects could it have?

Collapse
Posted by Dirk Gomez on
Here's an example that I recently tried on google: Hrvoje Niksic noquote - gives a result on ccm.redhat.com and it should give a few results on openacs.org, but doesn't.
Collapse
Posted by Dirk Gomez on
Ah it probably has to do with forums' forum-view page. The crawler would have to dig *really* deep and long to find 2 year old forum postings on openacs.org.

Maybe we should add an "All Messages" link to forum-view?

Collapse
Posted by Jeff Davis on
I think the lack of google indexing on openacs.org has more to do with the robots.txt file:
User-agent: *
Disallow: /
I added this a few months ago before the new memory was added since every time the site was spidered it would fall over (related to the memory leak I think).

We can change it back and see if spidering is ok but someone will need to keep an eye on things.

Collapse
Posted by Richard Hamilton on
Quite a lot of info in this thread so can I just clarify a couple of points please.

1) Are we saying that the robots detection module that Philip Greenspun wrote about in 'The Book' has no place in the ACS anymore because cloaking pages is considered underhand and will result in your site being blacklisted?
2) Are we also concluding that some of the assumptions that Philip made in relation to content that the search engines would not index are no longer correct (such as the impact of frames and other pretty content) and that search engines will now follow links with extensions such as .tcl, .cgi and others and do their best to index the content thereby rendering the acs robots module redundant?
3) Someone earlier suggested a link to all postings somewhere visually inconspicuous on the site. Is this a good idea or can anyone think of a better way to do it?

Regards
Richard
Collapse
Posted by Dave Bauer on
Dirk

I would hope that the 2 year old threads would have been indexed when they were new.

Collapse
Posted by Tilmann Singer on
Regarding the cloaking issue I remember that once a competitors website did that. They registered a bunch of bogus domains and created hundreds of sub-domains, which all redirected to the competitors main page normally, but when called with a user agent of google (I actually tried it myself), they returned a list of links to all the other bogus websites instead, thus trying to fool the page rank algorythm.

I mailed google and they said they were working on techniques to automatically detect and ban such sites, and banned the offending one manually. That was a few years ago, and the source for my assumption that returning different content based on googlebot user-agent header might be a bad idea. It might as well be though that they have a way of distinguishing between sites that try to fool the pagerank mechanism and those that only return more search engine friendly content, although I can't imagine how that would work 100% reliably.

Anyway, speculations about google behaviour could be continued endless I guess, but that is not my intention. Ok, a last one: I think if we remove the restriction in robots.txt (and the site doesn't fall over when being indexed) then google will index the full site including all postings after some time, and neither query variables nor the paginator on the forums index page will scare it away.

Collapse
Posted by Michael Bluett on
Regarding Richard's points:
1)Yes.
2)Search Engines are not born equal, some search engines are worse at finding content on sites. Google will follow links that have extensions such as .tcl. It does at least worry about query strings where a value of id is set, as it is possibly a session id.
3)For a site such as greenpeace.org, and possibly openacs.org, it is probably worth having pages dedicated to pointing search engines at the content you want seen on the site (a site map for search engines).

Tilmann:
Google has a spam reporting page.

Collapse
Posted by Andrew Piskorski on
Dirk, adding a "show all threads" link to the forum-view page sounds like a great idea to try. Has anyone volunteered to do it?
Collapse
Posted by Bjorn Thor Jonsson on
Just stumbled on this:

What Google Leaves Out
http://www.microdocs-news.info/newsGoogle/2003/05/10.html

Regarding email mangling, here's a javascript i've used:

var sUsername="user";
var sDomain="example.org";
var sSeparator="@";
var concat=sUsername+sSeparator+sDomain;
document.write(concat.link("mailto:"+concat));

Maybe a jpeg is more useful, but I don't get spam :)  (but I recently received a couple of scam messages to the address I use solely at openacs.org, maybe it was harvested by a human)

Collapse
Posted by Chris Davies on
really, what are you expecting?

Google looks at your entry page, gets a Location Redirect, to a host that doesn't exist.  NOTE the Location: header that is sent to google's bot.  Also note that the href on the page is invalid as well.

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.0 302 Found
Set-Cookie: ad_session_id=38906532%2c0%20%7b296%201061178092%20AB09A976A688F9EB525E86919C2AC1E120F4861B%7d; Path=/; Max-Age=1200
Location: http://greenpeace-01.fra.de.colt-isc.net/homepage
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:21:32 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 348
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://greenpeace-01.fra.de.colt-isc.net/homepage">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>

mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns1.de.colt.net
greenpeace-01.fra.de.colt-isc.net A record currently not present at ns1.de.colt.net
mcd@mcdlp:~$ host greenpeace-01.fra.de.colt-isc.net ns0.de.colt.net
greenpeace-01.fra.de.colt-isc.net does not exist at ns0.de.colt.net (Authoritative answer)

I'm not surprised it doesn't follow properly.

So, you try HTTP/1.1

telnet greenpeace.org 80
Trying 213.61.48.245...
Connected to www.greenpeace.org.
Escape character is '^]'.
GET / HTTP/1.1
Host: greenpeace.org

HTTP/1.0 302 Found
Location: http://www.greenpeace.org/international_en/
Content-Type: text/html; charset=iso-8859-1
MIME-Version: 1.0
Date: Mon, 18 Aug 2003 03:24:53 GMT
Server: AOLserver/3.3.1+ad13
Content-Length: 342
Connection: close

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Redirection</TITLE>
</HEAD>
<BODY>
<H2>Redirection</H2>
<A HREF="http://www.greenpeace.org/international_en/">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/3.3.1+ad13 on http://greenpeace-01.fra.de.colt-isc.net</I></SMALL></P>

</BODY></HTML>
Connection closed by foreign host.

And this time is presented with a valid redirection and a valid URL in the HREF.

Google's initial crawl uses HTTP/1.0 -- these are some hits pulled from my logs on another site.

Log.20030801:218.145.25.78 - - [01/Aug/2003:06:11:06 -0300] "GET /robots.txt HTTP/1.0" 404 8092 "-" "GoogleBot"
Log.20030801:218.145.25.78 - - [01/Aug/2003:06:36:47 -0300] "GET /robots.txt HTTP/1.0" 404 8084 "-" "GoogleBot"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /robots.txt HTTP/1.0" 404 8085 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:10:51:35 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /robots.txt HTTP/1.0" 404 8083 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:11:49:03 -0300] "GET /index.html HTTP/1.0" 200 12312 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /robots.txt HTTP/1.0" 404 8097 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"Log.20030801:crawl16.googlebot.com - - [01/Aug/2003:18:52:26 -0300] "GET /Tools/Unix/index.html HTTP/1.0" 200 11363 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

Note that Google's requests are HTTP/1.0. Might be worthwhile making sure your site works for browsers that are not HTTP/1.1 compliant.

I would bet this has much more to do with google not spidering your site than dynamic content.

While ? and & are inherently evil giveaways to dynamic content, and google seems to penalize slightly for those -- or flags it as dynamic and sets a maximum number of pages to spider, I have dynamic sites (not using AOL server) that are 40000+ pages and indexed in google.

With that said, 302 redirects are evil for entry pages -- I had a site that appeared to be banned specifically because I used some session tracking code that did a check to see if the cookie was set, if not, it munged the URL -- so, when google didn't respond with the cookie, a double redirect and a munged url resulted.  It took 14 months of convincing Google that I wasn't doing anything stealthy and a big rewrite of code to fix things.

So, check that HTTP/1.0 response that you're handing, fix that, and I'll bet you get back in google after a few crawls.

Collapse
Posted by Brian Fitzgearld on
Whoa. A belated THANKS, Chris.  Fantastic bit of forensics, much obliged.
Collapse
Posted by Klyde Beattie on
_and if not then they assume that you are trying to cheat them._

They have actual people do this and if they see that you are only trying to serve the page in a way that is more google friendly they will most likely not blacklist you.