Welcome to WebmasterWorld Guest from 220.127.116.11
This canonical problem comes from adding a period to the end of a domain name - http://www.example.com. - and that can trigger a cascade of potential problems. If the trailing period is at the end of the domain name and the site's navigation uses relative urls, then the extra period gets carried forward, and forward, and forward, through succeeding links.
There's a new thread in our Apache Forum that touches on the issue, and it also shares a fix - [webmasterworld.com...] As moderator jdMorgan observes, even google.com. has this problem!
This kind of link can be generated innocently enough by forum software that automatically creates links for text strings that look like urls but are at the end of a sentence. And many servers will not have a problem resolving that url with an extra period.
So, for the sake of a complete reference, I'd like to collect the potential canonical url issues all in one place.
Canonical URL Issues
- Different domain names serving the same content (302 redirects can make this kind of mess)
- Different hostnames within one domain, such as "with-www" and "no-www" versions
- With and without "index.html" for the domain root or a subdirectory root
- Different protocols - https and http
- Trailing period on the domain name
- Double foward slash in the filepath - http://example.com//page.html
- Swapping the order of query string parameters
- URL rewrite that allows typos for the "keyworded" virtual directory name
- Any forum software or CMS that generates alternate urls for the same content
- URLs that include session parameters, clickpath tracking, etc.
- Adding a port number to the domain name: example.com:443
- URLs with unneeded query strings or extra parameters in the query string
... and there's more
* Domain name versus IP address of server.
* Your main domain name versus a named subdomain/folder of your hosting company's domain name.
* Trailing slash versus no trailing slash on folder names.
* Infinite wildcard subdomains.
* Trailing quotes on requests that have been auto-linked from forum and blog posts, or from sites that have botched the HTML code in their link to you.
* Trailing question mark on end of URL, but no parameters present.
* Differing capitalisation within URLs (mostly affects IIS).
* 'Fake' parameters (that are not processed) on the end of a URL for a non-dynamic site, or that are ignored on a dynamic site.
* Extra parameters on sideways links, like the "nextoldest" and "nextnewest" links in forums such as vBulletin and PHPbb.
* Differing drill-down paths within a website, or via internal search, where the content does not have a specific URL, but the URL is "built" using the path you took to get there.
* URLs where only part of the URL is needed and the rest is fluff. Affects blogs with SEF linking with keywords in the URL, but those are not used to pull a specific record from the database, hence yourdomain.com/blog/34567-blue-widgets.html and yourdomain.com/blog/34567-this-site-is-run-by-spammers-and-idiots-do-not-buy-this-junk.html will show the same content.
* URLs where there are extra parameters within that are used to "build" the navigational links on that page out to other related content.
* "Page one" problems. This is where a site has a sub-section with numbered pages, and where "page one" has a different URL depending on whether you get to it from a section index or from "page two".
* Moving pagination. This is where new content on &page=1 today, is moved to &page=2 tomorrow and new content appears on &page=1. The next day, content on &page=1 moves to &page=2 and content on &page=2 moves to &page=3 etc. The oldest content is forever appearing at a new URL each day, with a page number that is one greater than yesterday.
This canonical problem comes from adding a period to the end of a domain name
It is hard to believe that this can cause any problems as in DNS system dot at the end of a domain designates fully-qualified domain name.
So, some relative link ./filename called from example.com./dir should translate into example.com./dir/filename without any problems.
Do you have any negative experience with this?
Duplicate Content [webmasterworld.com] - get it right or perish
Why "www" & "no-www" Are Different [webmasterworld.com] - the canonical duplicate issue
HTTPS versus HTTP [webmasterworld.com] - one more duplicate area
Domain Root vs. index.html [webmasterworld.com] - yet another kind of duplicate
Custom Error Pages [webmasterworld.com] - beware the server header status code
Vbulletin [webmasterworld.com] & Wordpress [webmasterworld.com] - duplicate content pitfalls
[added] The default server behaviour is to use the domain in the "local URL-path" part, as long as it's valid. [/added]
[edited by: jdMorgan at 5:51 pm (utc) on Aug. 12, 2008]
Google and most of the other search engines use back-end processes to "de-duplicate" or "canonicalize" URLs, and often infer the "correct" single URL for a resource. But we have many cases posted here of webmasters who say that the "wrong URL" is showing up in search results. Or they post that their home page in "www" is PR4, while their non-www home page is PR3, indicating that PageRank has been 'split' across these two domains.
They're not understanding that preventative server-side measures are required.
Then there's the issue of exploits. Given that a site could potentially have HTTP/HTTPS, www and non-www, trailing dot on the hostname, trailing port number on the hostname, "index.php" vs. "/", use 'virtual subdirectories' for SEO-friendly keyword-in-URL URLs while not enforcing a particular 'closed set' of values for that URL-path-part, and then adding practically-infinite query string variations, the number of URLs that could be used to reach a particular resource can indeed grow to be practically infinite -- limited only by the server settings which limit the length of the HTTP request header. A malicious competitor could potentially dilute the PageRank of your important pages with a bit of "creative linking."
In cases where large numbers of URLs resolve to a single resource, there are several dangers:
So the whole point of this thread (and several others that Tedster cited above) is that one resource (e.g. one "page") should have only and only one URL by which it is accessible, and all other "valid" variants of that URL should be redirected to that single canonical URL.
Hmmm, that can't be good. Off to do some digging on my own. I want to find out exactly what purpose that dot is serving in this instance. It appears to affect every site I've visited. You can append that dot to any URI and it will return a 200. At least the ones I've tested so far.
How many times do you see sentences ending with URI references and no trailing forward slash?
Something like this http://www.example.com. I see it all the time.
I always add a trailing forward slash to my written references. Whew, covers me there. But, that whole period thing is a concern. I need to dig up the RFCs and read, read, read. :)
I don't think that can confuse bots in any way, including duplicate issues.
I wouldn't be too certain about that just yet.
I do notice that a different set of cookies is set when adding the dot. That can't be a good thing in this scenario, can it?
P.S. It was a triple trailing dot.
That de-duplication process might have a bug -- now or later. Or perhaps it can't always be run for all URLs before a new index is deployed, and the URLs from your site might not get processed.
If you care about your ranking or if it's important to your revenue, I advise putting the preventative measures in place on your server so that any particular piece of content can be accessed by one and only one URL, and all other variants --whether caused by human or machine error, and regardless of technical validity-- be 301-redirected to the single canonical URL. In this way, you control your own destiny, rather than relying on the search engines to "figure it out."
[edited by: tedster at 1:15 am (utc) on Aug. 22, 2008]
[edit reason] switch to example.com - it can never be owned [/edit]