homepage Welcome to WebmasterWorld Guest from 184.73.40.21
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Bing SEO toolkit says I have canonical prob on homepage
Boulder90

5+ Year Member



 
Msg#: 4069463 posted 12:51 am on Jan 28, 2010 (gmt 0)

Here is the Bing violation report:

The page with URL "http://www.example.com/" can also be accessed by using URL "http://www.example.com/index.htm".
Search engines identify unique pages by using URLs. When a single page can be accessed by using any one of multiple URLs, a search engine assumes that there are multiple unique pages. Use a single URL to reference a page to prevent dilution of page relevance. You can prevent dilution by following a standard URL format.

I have tried to address this by adding this to the .htaccess:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html\ HTTP/
RewriteRule ^(.*)index\.html$ http://www.example.com/$1 [R=301,L]

Yet Bing is still saying I have that canonical issue with index.html

Here is what my entire htaccess looks like:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html\ HTTP/
RewriteRule ^(.*)index\.html$ http://www.example.com/$1 [R=301,L]

thanks for any tips!

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 1:23 am on Jan 28, 2010 (gmt 0)

Using the order you have now, a non-www index request will pass through a double redirect, a redirection chain. You need to avoid that.

The index redirect, being more specific, must be listed before non-www to www canonical redirect. That will fix the issue.

Your index redirect can be coded a bit more efficiently. The .* part should be replaced with a better pattern. Luckily the code has been posted hundreds of times in this forum.

Is the report a 'live' report? I wouldn't think so. I would allow at least a week or more for the status to update.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 1:34 am on Jan 28, 2010 (gmt 0)

Also, you say the report mentions "index.htm", while your code is for "index.html".

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 1:39 am on Jan 28, 2010 (gmt 0)

g1smd -

Thank you for your response. What you would suggest for a better pattern over the "*"? I have reversed the code:

RewriteEngine On

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.htm\ HTTP/
RewriteRule ^(.*)index\.htm$ http://www.example.com/$1 [R=301,L]

RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

Also, this Bing SEO scan is some new program that you install to your computer,and then it does a real time scan of your site looking for "SEO" errors. I found out about this app here at webmasterworld the other day.

I am assuming that since I can make changes and then rescan my site live, any changes should show up with the next batch of results.

[edited by: Boulder90 at 1:41 am (utc) on Jan. 28, 2010]

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 1:40 am on Jan 28, 2010 (gmt 0)

jd - thanks for pointing that out. Wow.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 3:54 am on Jan 28, 2010 (gmt 0)

Making the pattern a bit more specific, fixing an escaping problem, and fortifying the domain canonicalization in light of recent changes in the security landscape, I'd suggest:

RewriteEngine On
#
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.htm\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index\.htm$ http://www.example.com/$1 [NC,R=301,L]
#
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
RewriteCond %{HTTP_HOST} !^(www\.example\.com¦192\.168\.0\.2)?$
RewriteRule ^ - [F]

Note that the third rule is not required (and serves no purpose) unless you have a dedicated IP address for your site (or are planning to get one). If not applicable, feel free to omit it. Similarly, the first RewriteCond of the second rule can be omitted if you don't have a dedicated IP address.

If you do have a dedicated IP address, then put it into the third rule's RewriteCond with the literal periods escaped as shown above. Replace the broken pipe "¦" character with a solid pipe character before use; Posting on this forum modifies the pipe characters.

I'm not sure why (specifically) you said "Wow" but that ".htm error-report versus .html pattern" issue is a good example of something we repeat fairly often around here: mod_rewrite is utterly unforgiving, and one little typo can cause a rule not to work or effectively knock your server offline -- or worse, it can slowly and quietly eat away at your search rankings due to some unexpected and hard-to-detect side-effect. So intense concentration and attention to detail is critical.

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 4:53 am on Jan 28, 2010 (gmt 0)

Jim, thank you so much. That is greatly appreciated. These sort of responses are pushing me towards signing up for the paid forum areas here.

I do have a dedicated IP for the site that I pay $2 a month for. I am curious about the IP rewrite. What does it do?

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 1:44 pm on Jan 28, 2010 (gmt 0)

As coded above, there is no IP rewrite. Rather, it is an exclusion that allows access by IP address.

The code will redirect to canonicalize any non-canonical request which *does* include the correct 'base' domain, and reject all other requests unless the request is by IP address or the HTTP Host request header is blank (as it will be for true HTTP/1.0 requests). Since named-based virtual hosts cannot be accessed by true HTTP/1.0 clients (because they require the Hosts header to work) or by IP address (because it is shared among many sites), these provisions aren't needed if you don't have a dedicated IP. (Read this carefully, it's a rather complex statement).

For named-based servers, with the lines I mentioned omitted, the code simply reverts to saying, "If the requested hostname isn't exactly "www.example.com", then redirect to www.example.com".

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 4:39 am on Jan 30, 2010 (gmt 0)

Thanks Jim. So far so good.

I had a quick question for the readers on if this is an appropriate robots.txt file:

User-agent: *
Disallow:

Disallow: /forums/index.php?action=help*
Disallow: /forums/index.php?action=search*
Disallow: /forums/index.php?action=login*
Disallow: /forums/index.php?action=register*
Disallow: /forums/index.php?action=admin*
Disallow: /forums/index.php?action=post*
Disallow: /forums/index.php?action=who*
disallow: /forums/index.php?action=printpage

I've had a big problem with getting my pages in Google's non-supp lately, even with really good content. Not sure what is going on.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 7:37 am on Jan 30, 2010 (gmt 0)

The trailing * is not needed.

You only need a wildcard if the wildcard is at the beginning or in the middle, e.g.

Disallow: /*action=post
Disallow: /forums/*action=who

Be aware that the pattern matches from the left, so this will match any action...

Disallow: /*action=

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 10:26 am on Feb 1, 2010 (gmt 0)

Thank you g1smd.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 1:54 pm on Feb 1, 2010 (gmt 0)

> problem with getting my pages in Google's non-supp lately, even with really good content.

One of the often-overlooked factors in this is that each page needs a *unique* title and description, and both must be relevant to the pages' contents.

There are hundreds of other on- and off-page ranking factors, of course, but this one seems to get overlooked fairly often.

Note that query string and wild-card robots.txt Disallows are not supported by all search engines. It would be a good idea to check all of the robots that are important to you -- visiting their "Webmaster info" pages, and verifying that they support these extensions to the Standard for Robot Exclusion. You may need to add explicit policy records in your robots.txt file for those that do not support these extensions.

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 2:17 pm on Feb 1, 2010 (gmt 0)

Jim -

My commercial site is outdoors related, and it covers national parks and forests. I then break it down into camping, fishing ,etc. So each page is not going to be exactly its own thing. For example, my page title would be something like "san juan national forest camping", wth meta descriptions about that, and then going to the next page it would be "san juan national forest fishing" as the title and a meta decription about that. Another section on my site would cover say the Roosevelt national forest. Page titles would be "roosevelt national forest camping". They are unique and different content, and I add the national forest part because if I just put "camping" as the title, then I would have tons of pages with the same title. It's all different content though. Google is killing me. It doesn't seem to be a problem for Bing though which has 80% of my images indexed and 80% of my pages in their non-supp. Google? 25 images indexed and 50 pages in the non-supp. Scary. I'm not copying anyone elses content like my competitors, it's all super unique, my own writing, research and site specific images. Despite releasing over 70 new pages of hard fought content the past month, Google has added nothing to it's non-supps from my site. It's frustrating.

Thx for the tip on the wild-card.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 2:54 pm on Feb 1, 2010 (gmt 0)

The more different you make the titles and descriptions, the better. These are very important in escaping what we used to call "supplemental hell." Use synonyms to break up the monotony of the titles -- camping/camp-outs/campsites/campers, etc. Change the word order.

Descriptions should be written in a manner that is as far from boiler-plate as possible. Change the words, change the word order, change the sentence order, change the 'tone' -- especially between pages with similar titles and content. "San Juan National Forest Camping" vs. "Camping facilities in San Juan National Forest" or "Campsites in San Juan National Forest," for example. Explore the "keyword spaces" that you have available to you, and change them up!

Your logs and stats can be quite useful here, as the search phrases of visitors landing on your site can and do vary -- and can inform your titling and description-writing decisions.

To be clear, if you're wondering whether I'm recommending that you manually compose unique titles and descriptions for each and every page, the answer is "Yes." If this sounds like too much, then pick a section of your site that you're having trouble with, and try it on a limited basis. If it works to pop those pages out of limbo, then you can decide whether it's worth doing on a wider scale. If not, then at least you'll know that some other factor is likely more important in keeping those pages from performing.

If you're only waiting a month to evaluate new pages' rankings, that's not long enough. Although G returns results almost instantaneously, that's not the case for indexing and ranking updates. The time required to get a page indexed and rank will vary according to your current pages' ranking and the nature and effectiveness of your on-site linking strategy, but 30 days is only long enough if you're a top-ranked site.

70 pages of new content per month also raises a flag: Make sure these pages are not "thin" with fewer than six full paragraphs of information -- Six is not a magic number here, I'm just trying to delineate what I mean by "thin." If your pages are thin, then either fatten them up a bit (with more useful, unique info), or consider combining multiple smaller pages into fewer larger pages based on region, activity, and type of facility (park, forest, monument, trail, etc.).

Also, do be sure that for any given 'page' of content of your site, it can be reached with one and only one canonical URL, with all variations in protocol, domain, subdomain, URL-path, and query strings 301-redirected to that one unique URL. Otherwise, you have the classic "duplicate content" situation, and your multiple URLs will compete with each other for links, traffic, and ranking.

You may read of "duplicate-content penalties" here and in other Webmaster/SEO-related forums. Except in the most egregious cases of intentional content-duplication, there is scant evidence for any actual penalties imposed by search engines, but the self-competing described in the previous paragraph can indeed be self-defeating.

I'll commend the Google Search forum, its library, and the list of threads pinned at its top to you for more information on supplementals, indexing/ranking cycles, and optimization of on- and off-page ranking factors.

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 9:27 pm on Feb 3, 2010 (gmt 0)

Jim -

Thank you for the incredible answer. I greatly appreciate it! The suggestions on the title tages make a lot of sense. For some reason I though that doing that would sort of reduce coherency in terms of an organizatonal aspect for my readers/viewers but I guess not. I was trying to design my site in a way that made sense to the user, but I guess I should be designing it for Google instead.

I'm working on a new 17 page section of the site right now and will implement the strategy you have outlined. I am also adding my comapny name at the end of the title(good or bad) to increase click through. I will also immediately implement your suggestion to a few of the floundering pages, but ones that have a good amount of text as well so I am covering both bases.

The 70 pages I mentioned was really about two months worth of work. I update my site in large "dumps" rather than everyday. The forum stuff gets updated of course but the static pages require research (and for me to visit) various places around the U.S. Th pages do containt lots of text and images. I have combined pages my new section though to see if that does the trick. Like you said, better one huge page that gets traffic then a bunch of lesser pages that don't. But then again, to be perfectly honest it makes more sense to put pictures and text on separae pages for the users. Not everyone wants to scroll down huge amounts to get content that could easily be displayed right at the top of the page.And again, it seems like I'm having to design for Google rather than the user, which is the exact opposite of what Google mandates.

As for the Canonical issue, this should shore things up for the enitre site, right?

RewriteEngine On
#
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.htm\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index\.htm$ http://www.example.com/$1 [NC,R=301,L]
#
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
RewriteCond %{HTTP_HOST} !^(www\.example\.com|000\.000\.000\.000)?$
RewriteRule ^ - [F]


Thanks a million for that reply. Very helpful.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 12:17 am on Feb 4, 2010 (gmt 0)

Put the access-control code first -- There's no use wasting a perfectly-good 301 redirect on an unwelcome visitor... :)

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 2:29 pm on Feb 6, 2010 (gmt 0)

Jim, you mean like this?


RewriteEngine On

RewriteCond %{HTTP_HOST} !^(www\.example\.com|000\.000\.000\.000)?$
RewriteRule ^ - [F]
#
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.htm\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index\.htm$ http://www.example.com/$1 [NC,R=301,L]
#
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]



Thanks.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 2:42 pm on Feb 6, 2010 (gmt 0)

Yes, although you will need to allow for non-www:

RewriteCond %{HTTP_HOST} !^((www\.)?example\.com|000\.000\.000\.000)?$

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 2:54 pm on Feb 6, 2010 (gmt 0)

Thx Jim. It doesn't make sense to me unless I view it as the entire piece..does this look ok?

RewriteEngine On

RewriteCond %{HTTP_HOST} !^((www\.)?example\.com|000\.000\.000\.000)?$
RewriteCond %{HTTP_HOST} !^(www\.example\.com|000\.000\.000\.000)?$
RewriteRule ^ - [F]
#
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.htm\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index\.htm$ http://www.example.com/$1 [NC,R=301,L]
#
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4069463 posted 3:42 pm on Feb 6, 2010 (gmt 0)

The second RewriteCond is redundant, as the first will accept the domain name with or without "www.". Delete the second RewriteCond.

You can slightly improve your third RewriteCond by changing it from
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([b][^/][/b]+/)*index\.htm\ HTTP/ [NC]
to
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([b][^/\ ][/b]+/)*index\.htm\ HTTP/ [NC]

Jim

Boulder90

5+ Year Member



 
Msg#: 4069463 posted 6:37 pm on Feb 6, 2010 (gmt 0)

RewriteEngine On

RewriteCond %{HTTP_HOST} !^((www\.)?example\.com|000\.000\.000\.000)?$
RewriteRule ^ - [F]
#
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*index\.htm\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index\.htm$ http://www.example.com/$1 [NC,R=301,L]
#
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{HTTP_HOST} !=www.example.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]



Ok, I think I got it....

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved