Forum Moderators: Robert Charlton & goodroi
Someone places a link pointing to your site on the page of some forum or CMS. The site software can parse URLs and automatically turn them into a clickable link.
As posted, the URL has punctuation immediately after the end of the URL. Some of those auto-link systems will incorrectly include the punctuation within the link as if it were supposed to be a part of the URL for content on your site.
You end up getting requests for
example.co[b]m.[/b] and example.co[b]m,[/b] and example.com/thispag[b]e![/b] so on. Some of those requests can be fulfilled by your server, and create a Duplicate URL for the content. Others simply return a 404 error and in those cases you lost both the visitor and the power of the inbound link. Starting at least a year ago, and increasing rapidly in recent months, I see a fair number of sites that internally link using # as the very last character of the URL in the link.
Since that URL is then displayed, with the #, in the URL bar of the browser after it is clicked, any copy and paste action on the URL itself will still include the # mark.
If there's any trailing junk included on the end of these links on other sites, the junk is now rendered inoperative in causing Duplicate Content issues, as generally everything after the # is ignored by search engines when determining URLs.
I know that for some sites this is an unintended consequence of using AJAX features which use the # for their own purpose, but in other cases the implementation reason could possibly be for counteracting the 'trailing junk' problem.
Any other reasons why the sites are doing this? Is it merely another unintended consequence? In any case it is a neat method for (at least partially) fixing this type of Duplicate Content problem without having to install several 301 redirect rules. Of course, I'd still install the redirects in case the URLs are posted sans # mark.
cache:http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html
It looks like Google likes some identifiers after that "#".
I was quite puzzled over this, but there don't seems to be any problem with the web pages being loaded.
If Google is using the trailing "#" themselves, I don't think their search engine would ignore the data after "#". After all AJAX is an legitimate language.
There was a proposal for making ajax crawlable...
Google blog [googlewebmastercentral.blogspot.com]
Our discussion [webmasterworld.com]
However, IE, Mozilla, Opera, and other members of the 'major' browser families strip the fragment identifier and send only the URL+query string to the server.
And being an old server-side-only Luddite, I also haven't investigated how the fragment identifier is handled by AJAX, or whether the observed browser behavior changes if an exclamation point is appended to the "#" (as recently proposed by Google [webmasterworld.com] to denote AJAX state names).
Jim
Would adding a canonical tag be a more reliable way...
Theoretically perhaps - but "reliable" sort of means "in practice" so it's hard to say. There have not been any horror stories of the canonical tag going wrong, but I've also not read much about it providing a road out of a tough situation where the site improved their rankings and traffic.
see: [webmasterworld.com...]
i have problems to identify this string in my script. a "split domain by '.'" in perl leads to nowhere, because i want the identifier "dot", not the result, which in this case would be "" (empty like it was no dot at the end) for the part after the complete url. how do you combat the trailing dot (also in other progam languages and maybe apart from htaccess)? i've found no solution since one day..
I don't write Perl, but in PHP:
<?php
echo $_SERVER['HTTP_HOST']."<br />";
if(strlen($_SERVER['HTTP_HOST'])-1 === strrpos($_SERVER['HTTP_HOST'],".")) {
echo $NoTrailingDot = preg_replace("/\.$/","",$_SERVER['HTTP_HOST']);
}
?>
OR
<?php
echo $_SERVER['HTTP_HOST']."<br />";
if($_SERVER['HTTP_HOST']) != 'www.example.com' && $_SERVER['HTTP_HOST']) != '') {
echo $NewHost = 'www.example.com';
}
?>
The first is more flexible.
The second might be more efficient.
Dunno for sure... Haven't tested.
www.example.com -> HTTP_HOST = www.example.com
www.example.com. -> HTTP_HOST = www.example.com
should be the same outcome in htaccess, but i haven't tried yet.
again: how to detect the dot (without htaccess)?
thanks. but as i see it, the problem is that $_SERVER['HTTP_HOST'] (or at least $ENV{'HTTP_HOST'} in perl and on my apache 2) has the same result in both cases:
The php example I posted was tested prior to posting, so either there's a difference in Perl or your Apache Version. To know where the difference is, try the code I posted on your server... When tested the first 'echo' displays the . (dot) on the end of the HOST, and the second doesn't.