Welcome to WebmasterWorld Guest from 54.227.52.24

Message Too Old, No Replies

Combat a type of duplicate content by adding a blank hash mark "#"

     

g1smd

12:23 am on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



We have discussed the 'trailing junk on the end of URLs' Duplicate Content issue several times in the past 4 or 5 years.

Someone places a link pointing to your site on the page of some forum or CMS. The site software can parse URLs and automatically turn them into a clickable link.

As posted, the URL has punctuation immediately after the end of the URL. Some of those auto-link systems will incorrectly include the punctuation within the link as if it were supposed to be a part of the URL for content on your site.

You end up getting requests for

example.co[b]m.[/b]
and
example.co[b]m,[/b]
and
example.com/thispag[b]e![/b]
so on. Some of those requests can be fulfilled by your server, and create a Duplicate URL for the content. Others simply return a 404 error and in those cases you lost both the visitor and the power of the inbound link.

Starting at least a year ago, and increasing rapidly in recent months, I see a fair number of sites that internally link using # as the very last character of the URL in the link.

Since that URL is then displayed, with the #, in the URL bar of the browser after it is clicked, any copy and paste action on the URL itself will still include the # mark.

If there's any trailing junk included on the end of these links on other sites, the junk is now rendered inoperative in causing Duplicate Content issues, as generally everything after the # is ignored by search engines when determining URLs.

I know that for some sites this is an unintended consequence of using AJAX features which use the # for their own purpose, but in other cases the implementation reason could possibly be for counteracting the 'trailing junk' problem.

Any other reasons why the sites are doing this? Is it merely another unintended consequence? In any case it is a neat method for (at least partially) fixing this type of Duplicate Content problem without having to install several 301 redirect rules. Of course, I'd still install the redirects in case the URLs are posted sans # mark.

tedster

6:19 pm on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Interesting observation, g1smd. I've never heard of anyone intentionally doing this, but I can see how it might help. You're probably right that the examples you see are usually side effects of AJAX.

g1smd

10:17 pm on Oct 21, 2009 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Thanks Tedster!

Could this be the shortest thread I've ever started?

Receptional Andy

10:24 pm on Oct 21, 2009 (gmt 0)



Google have been making moves to encourage the # as a part of the URL to be processed by a server, rather than the client. What's you're seeing may be an unintended consequence of that, alongside AJAX - as you mention.

2clean

7:11 am on Oct 23, 2009 (gmt 0)

5+ Year Member



I'm guessing that you can strip out any junk via mod-rewrite and clean things up.

joelgreen

7:27 am on Oct 23, 2009 (gmt 0)

5+ Year Member



There was a proposal for making ajax crawlable, but looks like post had been removed for some reason (maybe just a glitch on my end). Still can be seen via cache

cache:http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html

It looks like Google likes some identifiers after that "#".

alexchia01

8:50 am on Oct 23, 2009 (gmt 0)

5+ Year Member



I discover this trailing "#" at Google's own Knol site. It seems that every time, you type in a knol URL, Google automatically add the "#" to the URL.

I was quite puzzled over this, but there don't seems to be any problem with the web pages being loaded.

If Google is using the trailing "#" themselves, I don't think their search engine would ignore the data after "#". After all AJAX is an legitimate language.

tedster

1:27 pm on Oct 23, 2009 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



There was a proposal for making ajax crawlable...

Google blog [googlewebmastercentral.blogspot.com]
Our discussion [webmasterworld.com]

pageoneresults

1:38 pm on Oct 23, 2009 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I'd tend to lean more towards what Receptional Andy describes above. According to protocol, the hash symbol and everything after it are supposed to be dereferenced by user-agents. Apparently Google have changed that. I guess they thought the protocol was limited and decided to start referencing those # points of entry. Hey, when you run the Internet, you can do what you want. ;)

jdMorgan

1:39 pm on Oct 23, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If anyone's contemplating intentionally implementing this technique, be aware that in the context of a click on a text link of the form <a href="url">, only Apple Safari actually sends the fragment identifier to the server. Since Google's Chrome shares part of the same code-base, it may do this as well, but I haven't tested Chrome.

However, IE, Mozilla, Opera, and other members of the 'major' browser families strip the fragment identifier and send only the URL+query string to the server.

And being an old server-side-only Luddite, I also haven't investigated how the fragment identifier is handled by AJAX, or whether the observed browser behavior changes if an exclamation point is appended to the "#" (as recently proposed by Google [webmasterworld.com] to denote AJAX state names).

Jim

arieng

4:27 pm on Oct 23, 2009 (gmt 0)

5+ Year Member



Would adding a canonical tag be a more reliable way to do away with duplicate content from trailing junk?

mirrornl

9:57 pm on Oct 23, 2009 (gmt 0)

5+ Year Member



this does not apply to javascript i suppose?
like
<a href="#" onclick="javascript: window.open...etc. ?

tedster

12:11 am on Oct 24, 2009 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Would adding a canonical tag be a more reliable way...

Theoretically perhaps - but "reliable" sort of means "in practice" so it's hard to say. There have not been any horror stories of the canonical tag going wrong, but I've also not read much about it providing a road out of a tough situation where the site improved their rankings and traffic.

moTi

2:49 am on Oct 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



whow, thanks for reminding us of the "dot example dot com dot" problem. never thought of that. i have problems to identify this string in my script. a "split domain by '.'" in perl leads to nowhere, because i want the identifier "dot", not the result, which in this case would be "" (empty like it was no dot at the end) for the part after the complete url. how do you combat the trailing dot (also in other progam languages and maybe apart from htaccess)? i've found no solution since one day..

bwakkie

12:03 pm on Oct 26, 2009 (gmt 0)

5+ Year Member



moTi: a rewrite rule should do it I guess

see: [webmasterworld.com...]

TheMadScientist

2:59 am on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



i have problems to identify this string in my script. a "split domain by '.'" in perl leads to nowhere, because i want the identifier "dot", not the result, which in this case would be "" (empty like it was no dot at the end) for the part after the complete url. how do you combat the trailing dot (also in other progam languages and maybe apart from htaccess)? i've found no solution since one day..

I don't write Perl, but in PHP:

<?php

echo $_SERVER['HTTP_HOST']."<br />";

if(strlen($_SERVER['HTTP_HOST'])-1 === strrpos($_SERVER['HTTP_HOST'],".")) {
echo $NoTrailingDot = preg_replace("/\.$/","",$_SERVER['HTTP_HOST']);
}

?>

OR

<?php

echo $_SERVER['HTTP_HOST']."<br />";

if($_SERVER['HTTP_HOST']) != 'www.example.com' && $_SERVER['HTTP_HOST']) != '') {
echo $NewHost = 'www.example.com';
}

?>

The first is more flexible.
The second might be more efficient.
Dunno for sure... Haven't tested.

moTi

2:40 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks. but as i see it, the problem is that $_SERVER['HTTP_HOST'] (or at least $ENV{'HTTP_HOST'} in perl and on my apache 2) has the same result in both cases:

www.example.com -> HTTP_HOST = www.example.com
www.example.com. -> HTTP_HOST = www.example.com

should be the same outcome in htaccess, but i haven't tried yet.
again: how to detect the dot (without htaccess)?

TheMadScientist

3:38 am on Nov 5, 2009 (gmt 0)

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



thanks. but as i see it, the problem is that $_SERVER['HTTP_HOST'] (or at least $ENV{'HTTP_HOST'} in perl and on my apache 2) has the same result in both cases:

The php example I posted was tested prior to posting, so either there's a difference in Perl or your Apache Version. To know where the difference is, try the code I posted on your server... When tested the first 'echo' displays the . (dot) on the end of the HOST, and the second doesn't.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month