Combat a type of duplicate content by adding a blank hash mark "#" - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Combat a type of duplicate content by adding a blank hash mark "#"

g1smd

12:23 am on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

We have discussed the 'trailing junk on the end of URLs' Duplicate Content issue several times in the past 4 or 5 years.

Someone places a link pointing to your site on the page of some forum or CMS. The site software can parse URLs and automatically turn them into a clickable link.

As posted, the URL has punctuation immediately after the end of the URL. Some of those auto-link systems will incorrectly include the punctuation within the link as if it were supposed to be a part of the URL for content on your site.

You end up getting requests for

example.co[b]m.[/b]

and

example.co[b]m,[/b]

and

example.com/thispag[b]e![/b]

so on. Some of those requests can be fulfilled by your server, and create a Duplicate URL for the content. Others simply return a 404 error and in those cases you lost both the visitor and the power of the inbound link.

Starting at least a year ago, and increasing rapidly in recent months, I see a fair number of sites that internally link using # as the very last character of the URL in the link.

Since that URL is then displayed, with the #, in the URL bar of the browser after it is clicked, any copy and paste action on the URL itself will still include the # mark.

If there's any trailing junk included on the end of these links on other sites, the junk is now rendered inoperative in causing Duplicate Content issues, as generally everything after the # is ignored by search engines when determining URLs.

I know that for some sites this is an unintended consequence of using AJAX features which use the # for their own purpose, but in other cases the implementation reason could possibly be for counteracting the 'trailing junk' problem.

Any other reasons why the sites are doing this? Is it merely another unintended consequence? In any case it is a neat method for (at least partially) fixing this type of Duplicate Content problem without having to install several 301 redirect rules. Of course, I'd still install the redirects in case the URLs are posted sans # mark.

tedster

6:19 pm on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Interesting observation, g1smd. I've never heard of anyone intentionally doing this, but I can see how it might help. You're probably right that the examples you see are usually side effects of AJAX.

g1smd

10:17 pm on Oct 21, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks Tedster!

Could this be the shortest thread I've ever started?

Receptional Andy

10:24 pm on Oct 21, 2009 (gmt 0)

Google have been making moves to encourage the # as a part of the URL to be processed by a server, rather than the client. What's you're seeing may be an unintended consequence of that, alongside AJAX - as you mention.

2clean

7:11 am on Oct 23, 2009 (gmt 0)

10+ Year Member

I'm guessing that you can strip out any junk via mod-rewrite and clean things up.

joelgreen

7:27 am on Oct 23, 2009 (gmt 0)

10+ Year Member

There was a proposal for making ajax crawlable, but looks like post had been removed for some reason (maybe just a glitch on my end). Still can be seen via cache

cache:http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html

It looks like Google likes some identifiers after that "#".

alexchia01

8:50 am on Oct 23, 2009 (gmt 0)

10+ Year Member

I discover this trailing "#" at Google's own Knol site. It seems that every time, you type in a knol URL, Google automatically add the "#" to the URL.

I was quite puzzled over this, but there don't seems to be any problem with the web pages being loaded.

If Google is using the trailing "#" themselves, I don't think their search engine would ignore the data after "#". After all AJAX is an legitimate language.

tedster

1:27 pm on Oct 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There was a proposal for making ajax crawlable...

Google blog [googlewebmastercentral.blogspot.com]
Our discussion [webmasterworld.com]

pageoneresults

1:38 pm on Oct 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'd tend to lean more towards what Receptional Andy describes above. According to protocol, the hash symbol and everything after it are supposed to be dereferenced by user-agents. Apparently Google have changed that. I guess they thought the protocol was limited and decided to start referencing those # points of entry. Hey, when you run the Internet, you can do what you want. ;)

jdMorgan

1:39 pm on Oct 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If anyone's contemplating intentionally implementing this technique, be aware that in the context of a click on a text link of the form <a href="url">, only Apple Safari actually sends the fragment identifier to the server. Since Google's Chrome shares part of the same code-base, it may do this as well, but I haven't tested Chrome.

However, IE, Mozilla, Opera, and other members of the 'major' browser families strip the fragment identifier and send only the URL+query string to the server.

And being an old server-side-only Luddite, I also haven't investigated how the fragment identifier is handled by AJAX, or whether the observed browser behavior changes if an exclamation point is appended to the "#" (as recently proposed by Google [webmasterworld.com] to denote AJAX state names).

Jim

arieng

4:27 pm on Oct 23, 2009 (gmt 0)

10+ Year Member

Would adding a canonical tag be a more reliable way to do away with duplicate content from trailing junk?

mirrornl

9:57 pm on Oct 23, 2009 (gmt 0)

10+ Year Member

this does not apply to javascript i suppose?
like
<a href="#" onclick="javascript: window.open...etc. ?

tedster

12:11 am on Oct 24, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Would adding a canonical tag be a more reliable way...

Theoretically perhaps - but "reliable" sort of means "in practice" so it's hard to say. There have not been any horror stories of the canonical tag going wrong, but I've also not read much about it providing a road out of a tough situation where the site improved their rankings and traffic.

moTi

2:49 am on Oct 24, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

whow, thanks for reminding us of the "dot example dot com dot" problem. never thought of that. i have problems to identify this string in my script. a "split domain by '.'" in perl leads to nowhere, because i want the identifier "dot", not the result, which in this case would be "" (empty like it was no dot at the end) for the part after the complete url. how do you combat the trailing dot (also in other progam languages and maybe apart from htaccess)? i've found no solution since one day..

bwakkie

12:03 pm on Oct 26, 2009 (gmt 0)

10+ Year Member

moTi: a rewrite rule should do it I guess

see: [webmasterworld.com...]

TheMadScientist

2:59 am on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

i have problems to identify this string in my script. a "split domain by '.'" in perl leads to nowhere, because i want the identifier "dot", not the result, which in this case would be "" (empty like it was no dot at the end) for the part after the complete url. how do you combat the trailing dot (also in other progam languages and maybe apart from htaccess)? i've found no solution since one day..

I don't write Perl, but in PHP:

<?php

echo $_SERVER['HTTP_HOST']."<br />";

if(strlen($_SERVER['HTTP_HOST'])-1 === strrpos($_SERVER['HTTP_HOST'],".")) {
echo $NoTrailingDot = preg_replace("/\.$/","",$_SERVER['HTTP_HOST']);
}

?>

OR

<?php

echo $_SERVER['HTTP_HOST']."<br />";

if($_SERVER['HTTP_HOST']) != 'www.example.com' && $_SERVER['HTTP_HOST']) != '') {
echo $NewHost = 'www.example.com';
}

?>

The first is more flexible.
The second might be more efficient.
Dunno for sure... Haven't tested.

moTi

2:40 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

thanks. but as i see it, the problem is that $_SERVER['HTTP_HOST'] (or at least $ENV{'HTTP_HOST'} in perl and on my apache 2) has the same result in both cases:

www.example.com -> HTTP_HOST = www.example.com
www.example.com. -> HTTP_HOST = www.example.com

should be the same outcome in htaccess, but i haven't tried yet.
again: how to detect the dot (without htaccess)?

TheMadScientist

3:38 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

thanks. but as i see it, the problem is that $_SERVER['HTTP_HOST'] (or at least $ENV{'HTTP_HOST'} in perl and on my apache 2) has the same result in both cases:

The php example I posted was tested prior to posting, so either there's a difference in Perl or your Apache Version. To know where the difference is, try the code I posted on your server... When tested the first 'echo' displays the . (dot) on the end of the HOST, and the second doesn't.