Forum OpenACS Q&A: Re: Approach to Google-optimizing

Collapse
Posted by Chris Davies on
there are a few other things that I think originally hurt greenpeace..

1) redirects -- if someone links to http://www.greenpeace.org/, and it redirects to something else, google's engine used to treat the 302 as a 404 and then spider the resulting content.  Not a huge problem until you realize that you get no PR transferred to the domain from the cache of links pointing at the site.

2) keyworded URLs.  recently it seems that google is penalizing .php, .phtml, .shtml, .shtm as 'dynamic'.  I've tested this numerous times with two clients and every time we check, the conclusion is the same.  ? and & in the url are also dynamic triggers and one of my biggest pet peeves.  Yes, if you can, put keywords in the directory path so that the pages have some chance at a higher relevence.

3) Http 1.0/googlebot requests without the Host Header.  I don't know that google still does this, but they used to have a bot that would do checks without sending a host header.  If I recall, greenpeace's website pointed surfers to a non-existent host when that happened.

Other notes:

cloaking.  There are things that you can do that will help google, that are not specifically cloaking.  Yes, they do have some bot that checks whether the page looks similar and contains similar elements, however, you can unfold menus, present navbars that allow google to spider more efficiently, etc.

content location.  I've had a theory for many years that google seems to put more weight on the first 5120 bytes of a page.  Thus, when you design a page that contains css, menus, headers and comments, etc, you are pushing the important page content 'lower' on the page to what google sees.  This in turn affects the relevence to other sites.

keyword relevence.  Google seems to take notice of particular phrases in the <a> container.

For instance, if you link to Nike as:

<a href="http://nike.com/">Nike</a>

you bump the keyword relevence for Nike.  However, a better keyword relevence might be:

<a href="http://nike.com/">Running Shoes</a>

A few other things I've learned along the way:
If at all possible, no inline javascript or css.  Google will try to index it as content.  Use Alt tags that represent what is in the picture (rather than alt="picture1")

404s are the devils bane.  If you put content online, leave it there.  Disk is cheap.  :)

Just some random thoughts.

Collapse
Posted by Dirk Gomez on
A good search engine should try to behave like an experienced web surfer. How do YOU rate a page?

You read the first few paragraphs and then decide on whether it makes sense to continue through the rest, so you rate the first bytes higher.

You look at the URL and decide upon whether it is dodgy or trustworthy.

You look at the bold and big letters. Hence a search engine should rate h2 and h3 higher.

You don't care about meta tags, hence a good search engine will silently ignore them as well.

I wouldn't even be astonished if average response time per transferred bytes weren't a metric. The slower the site, the worse it usually appears.

How much of the site appears to be original content, to what extent is it just a metasite - original content being a ton more interesting. e.g. the features section in Greenpeace links to a whole lot of different sites and gives the uninitiated bot the impression that the *major* navigation bar links to other sites. It knows that this is the major navigation bar because most sites that have links to the left use it for navigation.

Then: what do you want to be indexed? What are people looking for when they look for Greenpeace? greenpeace.org or some particular content...what would be ten search terms where Greenpeace should be ranked prominently. Which story or page seems to deserve a high ranking for any of these terms?

If we then look at the application - page - we might ponder why it doesn't get the rating it may deserve.

(All this assumptions. Remember that google said 2 years ago that they have more than 100 heuristics per page. :))