homepage Welcome to WebmasterWorld Guest from 54.161.185.244
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Canonical URL Issues - including some new ones
tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 4:22 am on Aug 8, 2008 (gmt 0)

There's a potential canonical URL issue that we've not touched on often, if ever. It's the kind of thing that might cause indexing issues or split PageRank into different "piles" - and even, potentially, generate duplicate URL problems.

This canonical problem comes from adding a period to the end of a domain name - http://www.example.com. - and that can trigger a cascade of potential problems. If the trailing period is at the end of the domain name and the site's navigation uses relative urls, then the extra period gets carried forward, and forward, and forward, through succeeding links.

There's a new thread in our Apache Forum that touches on the issue, and it also shares a fix - [webmasterworld.com...] As moderator jdMorgan observes, even google.com. has this problem!

This kind of link can be generated innocently enough by forum software that automatically creates links for text strings that look like urls but are at the end of a sentence. And many servers will not have a problem resolving that url with an extra period.

So, for the sake of a complete reference, I'd like to collect the potential canonical url issues all in one place.

Canonical URL Issues
  1. Different domain names serving the same content (302 redirects can make this kind of mess)
  2. Different hostnames within one domain, such as "with-www" and "no-www" versions
  3. With and without "index.html" for the domain root or a subdirectory root
  4. Different protocols - https and http
  5. Trailing period on the domain name
  6. Double foward slash in the filepath - http://example.com//page.html
  7. Swapping the order of query string parameters
  8. URL rewrite that allows typos for the "keyworded" virtual directory name
  9. Any forum software or CMS that generates alternate urls for the same content
  10. URLs that include session parameters, clickpath tracking, etc.
  11. Adding a port number to the domain name: example.com:443
  12. URLs with unneeded query strings or extra parameters in the query string

 

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 6:49 am on Aug 8, 2008 (gmt 0)

Whew, that's twelve of them. Combine them in one big pile and you have "12 factorial" - that's 479,001,600 - possible URLs for the exact same content!

Have I missed any?

Robert Charlton

WebmasterWorld Administrator robert_charlton us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3718246 posted 7:29 am on Aug 8, 2008 (gmt 0)

I'd consider the IP address a special case to be noted.

[edited by: Robert_Charlton at 7:36 am (utc) on Aug. 8, 2008]

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 7:51 am on Aug 8, 2008 (gmt 0)

Ah yes - factorial 13!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 10:46 pm on Aug 8, 2008 (gmt 0)

... and there's more

* Domain name versus IP address of server.

* Your main domain name versus a named subdomain/folder of your hosting company's domain name.

* Trailing slash versus no trailing slash on folder names.

* Infinite wildcard subdomains.

* Trailing quotes on requests that have been auto-linked from forum and blog posts, or from sites that have botched the HTML code in their link to you.

* Trailing question mark on end of URL, but no parameters present.

* Differing capitalisation within URLs (mostly affects IIS).

* 'Fake' parameters (that are not processed) on the end of a URL for a non-dynamic site, or that are ignored on a dynamic site.

* Extra parameters on sideways links, like the "nextoldest" and "nextnewest" links in forums such as vBulletin and PHPbb.

* Differing drill-down paths within a website, or via internal search, where the content does not have a specific URL, but the URL is "built" using the path you took to get there.

* URLs where only part of the URL is needed and the rest is fluff. Affects blogs with SEF linking with keywords in the URL, but those are not used to pull a specific record from the database, hence yourdomain.com/blog/34567-blue-widgets.html and yourdomain.com/blog/34567-this-site-is-run-by-spammers-and-idiots-do-not-buy-this-junk.html will show the same content.

* URLs where there are extra parameters within that are used to "build" the navigational links on that page out to other related content.

* "Page one" problems. This is where a site has a sub-section with numbered pages, and where "page one" has a different URL depending on whether you get to it from a section index or from "page two".

* Moving pagination. This is where new content on &page=1 today, is moved to &page=2 tomorrow and new content appears on &page=1. The next day, content on &page=1 moves to &page=2 and content on &page=2 moves to &page=3 etc. The oldest content is forever appearing at a new URL each day, with a page number that is one greater than yesterday.


I use "daily" here as an example. It could be weekly, monthly, or random intervals, depending on how quickly new content is posted.

activeco

10+ Year Member



 
Msg#: 3718246 posted 12:28 am on Aug 9, 2008 (gmt 0)

This canonical problem comes from adding a period to the end of a domain name

It is hard to believe that this can cause any problems as in DNS system dot at the end of a domain designates fully-qualified domain name.
So, some relative link ./filename called from example.com./dir should translate into example.com./dir/filename without any problems.

Do you have any negative experience with this?

Lorel

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3718246 posted 12:48 am on Aug 9, 2008 (gmt 0)

Golly sakes! The redirectS for all these must be enormous.

Marcia

WebmasterWorld Senior Member marcia us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 1:11 am on Aug 9, 2008 (gmt 0)

>>Double foward slash in the filepath - http://example.com//page.html

Inbound links that look like this:

http://example.com//

CainIV

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3718246 posted 3:01 am on Aug 9, 2008 (gmt 0)

Nice find Ted. I always just redirected anything behind the trailing slash of the root url to the exact root url to prevent this or any trailing character problem.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 10:26 pm on Aug 10, 2008 (gmt 0)

Here are some reference threads with more detail on some of these canonical url problems:

Duplicate Content [webmasterworld.com] - get it right or perish
Why "www" & "no-www" Are Different [webmasterworld.com] - the canonical duplicate issue
HTTPS versus HTTP [webmasterworld.com] - one more duplicate area
Domain Root vs. index.html [webmasterworld.com] - yet another kind of duplicate
Custom Error Pages [webmasterworld.com] - beware the server header status code
Vbulletin [webmasterworld.com] & Wordpress [webmasterworld.com] - duplicate content pitfalls

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 5:48 pm on Aug 12, 2008 (gmt 0)

Here's one more. The following URL is valid, but non-canonical:

http://www.WebmasterWorld.com/http://www.WebmasterWorld.com/

[added] The default server behaviour is to use the domain in the "local URL-path" part, as long as it's valid. [/added]

Jim

[edited by: jdMorgan at 5:51 pm (utc) on Aug. 12, 2008]

activeco

10+ Year Member



 
Msg#: 3718246 posted 7:06 pm on Aug 12, 2008 (gmt 0)

Jim, that's really strange. Neither dot at the end nor that example should produce any problems.
Any working example?

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3718246 posted 7:26 pm on Aug 12, 2008 (gmt 0)

The problem is not whether FQDN URLs or any of these other variations function. The problem is that they create more that one URL for a unique resource. This thread only makes sense in that context, and this is the reason for the thread title.

Google and most of the other search engines use back-end processes to "de-duplicate" or "canonicalize" URLs, and often infer the "correct" single URL for a resource. But we have many cases posted here of webmasters who say that the "wrong URL" is showing up in search results. Or they post that their home page in "www" is PR4, while their non-www home page is PR3, indicating that PageRank has been 'split' across these two domains.

They're not understanding that preventative server-side measures are required.

Then there's the issue of exploits. Given that a site could potentially have HTTP/HTTPS, www and non-www, trailing dot on the hostname, trailing port number on the hostname, "index.php" vs. "/", use 'virtual subdirectories' for SEO-friendly keyword-in-URL URLs while not enforcing a particular 'closed set' of values for that URL-path-part, and then adding practically-infinite query string variations, the number of URLs that could be used to reach a particular resource can indeed grow to be practically infinite -- limited only by the server settings which limit the length of the HTTP request header. A malicious competitor could potentially dilute the PageRank of your important pages with a bit of "creative linking."

In cases where large numbers of URLs resolve to a single resource, there are several dangers:

  • The "wrong" URL listed in search results.
  • PageRank/Link-popularity split among various URLs, reducing the rank of 'the' page.
  • So-called "duplicate-content penalties" --really a filter, IMO-- applied to resources with "too many" URLs.
  • User confusion (e.g. broken on-page visited-link highlighting).

    So the whole point of this thread (and several others that Tedster cited above) is that one resource (e.g. one "page") should have only and only one URL by which it is accessible, and all other "valid" variants of that URL should be redirected to that single canonical URL.

    Jim

  • activeco

    10+ Year Member



     
    Msg#: 3718246 posted 8:38 am on Aug 13, 2008 (gmt 0)

    OK, I got the point :), but I am still not convinced with the two mentioned cases. I have never seen any technical or canonical problem either with the tld's trailing dot or with the domain name in directory or file structure.
    I have noted a few listed links where the default handling of the query string was tried to be abused, but couldn't see any effects in SERPS.
    Something like [webmasterworld.com...]

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3718246 posted 8:03 pm on Aug 13, 2008 (gmt 0)

    I have fixed three "trailing dot" issues in the last year or so. I see it rarely, but the trend is upwards. All have been caused by auto-linking of URLs in forum or blog posts; where the designer of the forum or blog was sloppy in their URL-parsing rules when selecting what to auto-link.

    pageoneresults

    WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3718246 posted 8:18 pm on Aug 13, 2008 (gmt 0)

    http://www.example.com.

    Hmmm, that can't be good. Off to do some digging on my own. I want to find out exactly what purpose that dot is serving in this instance. It appears to affect every site I've visited. You can append that dot to any URI and it will return a 200. At least the ones I've tested so far.

    How many times do you see sentences ending with URI references and no trailing forward slash?

    Something like this http://www.example.com. I see it all the time.

    I always add a trailing forward slash to my written references. Whew, covers me there. But, that whole period thing is a concern. I need to dig up the RFCs and read, read, read. :)

    activeco

    10+ Year Member



     
    Msg#: 3718246 posted 8:34 pm on Aug 13, 2008 (gmt 0)

    The trailing dot speeds up name resolution and it actually should be a good practice to link to it!
    I don't think that can confuse bots in any way, including duplicate issues.

    pageoneresults

    WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3718246 posted 8:42 pm on Aug 13, 2008 (gmt 0)

    I don't think that can confuse bots in any way, including duplicate issues.

    I wouldn't be too certain about that just yet.

    I do notice that a different set of cookies is set when adding the dot. That can't be a good thing in this scenario, can it?

    activeco

    10+ Year Member



     
    Msg#: 3718246 posted 8:48 pm on Aug 13, 2008 (gmt 0)

    There was an exploit regarding trailing dot and cookies, which basically allowed reading all the cookies on the machine. But it was not the failure of the dot, but rather a browser. I think it is closed by now.

    P.S. It was a triple trailing dot.

    g1smd

    WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3718246 posted 8:52 pm on Aug 13, 2008 (gmt 0)

    *** I don't think that can confuse bots in any way, including duplicate issues. ***

    You would be wrong, as I have seen Google list the same URL twice - both with and without the dot.

    activeco

    10+ Year Member



     
    Msg#: 3718246 posted 9:16 pm on Aug 13, 2008 (gmt 0)

    You would be wrong, as I have seen Google list the same URL twice - both with and without the dot.

    Are you sure it was "tld./" and not "tld/."?

    [edited by: Robert_Charlton at 7:18 am (utc) on Aug. 14, 2008]

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3718246 posted 10:03 pm on Aug 13, 2008 (gmt 0)

    That'd be nice to know from an academic standpoint. But the fact remains that unless we take steps on the server side to absolutely prevent multiple URLs from accessing content, we hand over the 'welfare' of our PageRank and link-popularity to an 'extra step' of back-end de-duplication processing by the search engines.

    That de-duplication process might have a bug -- now or later. Or perhaps it can't always be run for all URLs before a new index is deployed, and the URLs from your site might not get processed.

    If you care about your ranking or if it's important to your revenue, I advise putting the preventative measures in place on your server so that any particular piece of content can be accessed by one and only one URL, and all other variants --whether caused by human or machine error, and regardless of technical validity-- be 301-redirected to the single canonical URL. In this way, you control your own destiny, rather than relying on the search engines to "figure it out."

    Jim

    theBear

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 3718246 posted 12:24 am on Aug 22, 2008 (gmt 0)

    #1 Server Response: http://www.example.com./google/3718246-3-10.htm.
    HTTP Status Code: HTTP/1.1 200 OK
    Date: Fri, 22 Aug 2008 00:23:11 GMT

    Oops !

    [edited by: tedster at 1:15 am (utc) on Aug. 22, 2008]
    [edit reason] switch to example.com - it can never be owned [/edit]

    activeco

    10+ Year Member



     
    Msg#: 3718246 posted 1:12 pm on Aug 22, 2008 (gmt 0)

    #1 Server Response: http://www.example.com./google/3718246-3-10.htm.

    Probably rewriting to QS which allows non-existent strings. The dot does not make the difference.

    theBear

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 3718246 posted 12:12 am on Aug 24, 2008 (gmt 0)

    activeco,

    It really doesn't make a difference. The end result is multiple names for the same information.

    I can show the same thing for many sites. I just wanted to point out that even WebmasterWorld has issues in the area of canonical names.

    keto

    5+ Year Member



     
    Msg#: 3718246 posted 5:41 pm on Aug 27, 2008 (gmt 0)

    Added the following rewrite rule and it seeems to be doing the trick.

    # Rewrite domains with a trailing period
    RewriteCond %{HTTP_HOST} ^(.*)\.$
    RewriteRule ^(.*)$ [%1...] [R=301,L]

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Google / Google SEO News and Discussion
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved