Forum OpenACS Q&A: Approach to Google-optimizing
1) Measure current Google rankings
- for 10 pages on the site, including 5 that will be changed and 5 that won't
- For each page, google search for the title, first 5 words, last 5 words, and 5 words in the middle, of the main text.
- How to measure Google ranking? Look through the search results and count the highest appearance of the site? Can this be automated through the Google API? What is acceptable use of Google?
2) Roll out a change
3) Repeat measurements 1 day, 5 days, 10 days, 30 days, 60 days after change
Planned changes, one at a time:
1) Put article titles into page <title> tags
2) Move article titles into <h?> tags from <div>
3) Implement pretty URLs (foo.org/article/1) and construct index pages that link to articles directly (current index pages show 10 at a time)
4) Add meta tags to form edit/add modes so that they don't get indexed
If you are willing to get dirty, the "professionals" have some nasty tricks to increase Google rankings. One guy I know maintains a special front page for each of his client sites that is only served to Googlebot (based on IP address I presume). This special page, which is sometimes (but often not) visible through Google Cache, contains links to popular sites, a search form (this is apparently a plus), as well links to all of his other clients.
This way he can exploit googlejuice between his clients without explicitly linking (they're often completely unrelated commercial sites).
And what did you optimize if PageRank goes up from 8/10 to 9/10?
A nice reference for this is http://selfpromotion.com/improve.t
You will screw yourself bigtime if you try to trick googlebot. Here are some links, read these first. I'm lauching a campaign to redo a few sites myself.
First, is the Greenpeace page you are interested in ranking, actually in Google? Page rank, after that point depends on the key words you choose. You shouldn't judge yourself or allow Greenpeace to judge you based on how a page ranks, but you can ask yourself, what are the key words I want Google to index? Are those words used on the page as a main theme? How relevent are other pages on the net based on searching those key words? Also, with key words, what are users typing into Google, you really want Google to return Greenpeace on subjects where they believe they have some authority, but you cannot choose what users will type.
Bottom line is there are no tricks, only sound writing skills and webmastering. Maybe one exception: hopefully a search for "Greenpeace" will return their home page...
- there is a really great application for Mac OS X called Advanced Web Ranking. It allows you to monitor your search results by keyword on hundreds of search engines, month by month or week by week, and it displays graphs and reports -- very effective for showing your client what you're trying to do.
- there was a great thread on openacs.org that discussed how to score better on Google, it refered to this great link:
One of the most important things, it seems, is to have a lot of meaningful links, and to make the site useful to other people, so that other people will link to it.
After that, directory naming seems very important. I notice that my rubick.com pages are found iff I name the directories something meaningful.
would be much better than
because people would most likely search for openacs ad_form
My idea for bug tracker would be to only index pages when no state variables are set and for photo-album to make the medium noindex nofollow so that the large image is not indexed.
After reading all the notes and some of the linked items, I'm wondering:
- Is it worth it to try and monitor results? greenpeace.org gets many google hits every day, but not every page is hit every day, and some authors claim that pages go months without re-indexing. Maybe we should just make the obvious fixes and leave it alone, or check back in 6 months.
- Should I put any effort into better pretty urls - not just /article/145 but putting a keyword into the pretty url? We do the foundation work for this in some parts of openacs, where short_name is a locally unique string suitable for a url. This is nicer for users, certainly - how standard can we make in OpenACS? Is it worth trying to retrofit this to old apps that just have ids, by creating a short-name field and populating it?
- Where else should we be setting noindex,nofollow? So far:
- in edit and add mode of form-builder
- in packages with duplicates. Are the duplicates a bigger problem then the possibility of not getting indexed at all if we block some pages from indexing and the "intended to be indexed" pages don't get hit? Maybe we're better off trusting the search engines' ability to hide duplicates.
One simple way to get a more descriptive url is to convert the item title into the cr_items.name using util_text_to_url.
1. Reasons your site may not be included.So since they limit the amount of dynamic content they spider you want to make what they do spider unique to increase coverage (and to lower the burden on your own server).
Your pages are dynamically generated. We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index.
Changing everything to have pretty urls would remedy the spidering scope issue (although it would still leave google pulling down an order of magnitude too many pages for things like bug tracker).
As discussed in When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics, a paper written by two Google researchers, authority sites often link to other authority sites. This paper describes an algorithm for ranking "expert" sites.
A paper entitled Authoritative Sources in a Hyperlinked Environment, written by Jon Kleinberg at Cornell, distinguishes between hubs and authority sites. Hubs have many outgoing links, ideally to related authority pages, and authority pages have many incoming links, ideally from related authority pages.
Improved Algorithms for Topic Distillation in a Hyperlinked Environment describes a query-based approach that ranks the interconnectivity of pages linked to and from the other top 1000 results for given query.
It is thought that Google has recently modified its algorithm to incorporate some or all of these techniques. In January I optimized a bank's site based on the algorithms discussed in these papers. The site launched in October, and I noticed substantial improvement in rankings after the Google February update.
It does not appear that modifying a page's outgoing links will have any immediate effect. It appears that site connectivity rankings are calculated once a month at the same time PageRank is calculated. In addition, the effects from tweaking the content of a page aren't as noticeable on a day to day basis. Until November, you could change the number of times you repeated a phrase in a page, and the next day you could notice a significant adjustment in the SERPs.
Also, in April 2003, Google acquired Applied Semantics. Recently it appears that it is more effective to use related keywords on a page/site than it is to optimize a page for a particular phrase. Patterns in Unstructured Data discusses the concept of latent semantic indexing. Use Google's Keyword Suggestion Tool to find a list of keywords Google identifies as related to your target phrase.
You can find links to all of the above papers, and ~40 others on my website: Search Engine Research Papers.
1) redirects -- if someone links to http://www.greenpeace.org/, and it redirects to something else, google's engine used to treat the 302 as a 404 and then spider the resulting content. Not a huge problem until you realize that you get no PR transferred to the domain from the cache of links pointing at the site.
2) keyworded URLs. recently it seems that google is penalizing .php, .phtml, .shtml, .shtm as 'dynamic'. I've tested this numerous times with two clients and every time we check, the conclusion is the same. ? and & in the url are also dynamic triggers and one of my biggest pet peeves. Yes, if you can, put keywords in the directory path so that the pages have some chance at a higher relevence.
3) Http 1.0/googlebot requests without the Host Header. I don't know that google still does this, but they used to have a bot that would do checks without sending a host header. If I recall, greenpeace's website pointed surfers to a non-existent host when that happened.
cloaking. There are things that you can do that will help google, that are not specifically cloaking. Yes, they do have some bot that checks whether the page looks similar and contains similar elements, however, you can unfold menus, present navbars that allow google to spider more efficiently, etc.
content location. I've had a theory for many years that google seems to put more weight on the first 5120 bytes of a page. Thus, when you design a page that contains css, menus, headers and comments, etc, you are pushing the important page content 'lower' on the page to what google sees. This in turn affects the relevence to other sites.
keyword relevence. Google seems to take notice of particular phrases in the <a> container.
For instance, if you link to Nike as:
you bump the keyword relevence for Nike. However, a better keyword relevence might be:
<a href="http://nike.com/">Running Shoes</a>
A few other things I've learned along the way:
404s are the devils bane. If you put content online, leave it there. Disk is cheap. :)
Just some random thoughts.
You read the first few paragraphs and then decide on whether it makes sense to continue through the rest, so you rate the first bytes higher.
You look at the URL and decide upon whether it is dodgy or trustworthy.
You look at the bold and big letters. Hence a search engine should rate h2 and h3 higher.
You don't care about meta tags, hence a good search engine will silently ignore them as well.
I wouldn't even be astonished if average response time per transferred bytes weren't a metric. The slower the site, the worse it usually appears.
How much of the site appears to be original content, to what extent is it just a metasite - original content being a ton more interesting. e.g. the features section in Greenpeace links to a whole lot of different sites and gives the uninitiated bot the impression that the *major* navigation bar links to other sites. It knows that this is the major navigation bar because most sites that have links to the left use it for navigation.
Then: what do you want to be indexed? What are people looking for when they look for Greenpeace? greenpeace.org or some particular content...what would be ten search terms where Greenpeace should be ranked prominently. Which story or page seems to deserve a high ranking for any of these terms?
If we then look at the application - page - we might ponder why it doesn't get the rating it may deserve.
(All this assumptions. Remember that google said 2 years ago that they have more than 100 heuristics per page. :))
My site is http://www.winnipegdatarecovery.com and I would like to optimize for "winnipeg data recovery" and "winnipeg file recovery"
My email address is mailto:email@example.com
First, I don't know how this question relates to OpenACS. Second, DO NOT CROSS-POST. It kills the slightest desire for answering. Third, did you at all read the thread you first posted to (i.e. the post I'm answering to)? It contains a lot of good tips and links that should be helpful to you. There's no silver bullet.
Your page title goes "WinnipegDataRecovery.com offers Winnipeg a Data Recovery Solution for hard drives, Digital Camera's, CD's, ZIP and Floppy Disks, Password Recovery, File Repairs for Office Documents".
Wonder why it ranks #1 with "winnipegdatarecovery" but not with "winnipeg data recovery"? To find the answer is left as an excercise ;)
"WinnipegDataRecovery.com offers Winnipeg a Data Recovery Solution for hard drives, Digital Camera's, CD's, ZIP and Floppy Disks, Password Recovery, File Repairs for Office Documents".
"Winnipeg Data Recovery . com offers Solution for hard drives, Digital Camera's, CD's, ZIP and Floppy Disks, Password Recovery, File Repairs for Office Documents"
Do you think this may help?
And related to google for thsi site, 1 thing straightly comes to mind... use of images. most of the links on this site that points to other major sections of the sites are images. which i believe is totally unneccessary. the menu on the left and the top is all possible creating using CSS stuff. Images makes it tough for google to guess what the linked page is all about and does not makes addition and editing of content easy!
I will be back with more as I get to study the site more! :)
The infrastructure package can be found at
The code that generates the sitemap for lars-blogger is in
For this to work, you will also need the directory "google-sitemaps" in the server root, which must be writable by the Web server.
The generated sitemaps can be downloaded at [ad_url]/google-sitemaps/index.xml. See
for an example.
If you want to retrieve a copy of the code, you can do so with Subversion and the URL http://www.clasohm.com/svn/clasohm.com/trunk/
For this to work, you will also need the directory "google-sitemaps" in the server root, which must be writable by the Web server.according to the Sitemaps docs, a sitemap can only refer to pages below it in the site hierarchy - so a map at http://example.com/google-sitemaps/index.xml will only be used for other pages below http://example.com/google-sitemaps/. So for this to actually help it looks like the sitemap file needs to be placed directly in the site root directory.
The generated sitemaps can be downloaded at [ad_url]/google-sitemaps/index.xml. See