homepage Welcome to WebmasterWorld Guest from 107.20.91.81
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Combat a type of duplicate content by adding a blank hash mark "#"
g1smd




msg:4009767
 12:23 am on Oct 20, 2009 (gmt 0)

We have discussed the 'trailing junk on the end of URLs' Duplicate Content issue several times in the past 4 or 5 years.

Someone places a link pointing to your site on the page of some forum or CMS. The site software can parse URLs and automatically turn them into a clickable link.

As posted, the URL has punctuation immediately after the end of the URL. Some of those auto-link systems will incorrectly include the punctuation within the link as if it were supposed to be a part of the URL for content on your site.

You end up getting requests for example.co[b]m.[/b] and example.co[b]m,[/b] and example.com/thispag[b]e![/b] so on. Some of those requests can be fulfilled by your server, and create a Duplicate URL for the content. Others simply return a 404 error and in those cases you lost both the visitor and the power of the inbound link.

Starting at least a year ago, and increasing rapidly in recent months, I see a fair number of sites that internally link using # as the very last character of the URL in the link.

Since that URL is then displayed, with the #, in the URL bar of the browser after it is clicked, any copy and paste action on the URL itself will still include the # mark.

If there's any trailing junk included on the end of these links on other sites, the junk is now rendered inoperative in causing Duplicate Content issues, as generally everything after the # is ignored by search engines when determining URLs.

I know that for some sites this is an unintended consequence of using AJAX features which use the # for their own purpose, but in other cases the implementation reason could possibly be for counteracting the 'trailing junk' problem.

Any other reasons why the sites are doing this? Is it merely another unintended consequence? In any case it is a neat method for (at least partially) fixing this type of Duplicate Content problem without having to install several 301 redirect rules. Of course, I'd still install the redirects in case the URLs are posted sans # mark.

 

tedster




msg:4010238
 6:19 pm on Oct 20, 2009 (gmt 0)

Interesting observation, g1smd. I've never heard of anyone intentionally doing this, but I can see how it might help. You're probably right that the examples you see are usually side effects of AJAX.

g1smd




msg:4011101
 10:17 pm on Oct 21, 2009 (gmt 0)

Thanks Tedster!

Could this be the shortest thread I've ever started?

Receptional Andy




msg:4011108
 10:24 pm on Oct 21, 2009 (gmt 0)

Google have been making moves to encourage the # as a part of the URL to be processed by a server, rather than the client. What's you're seeing may be an unintended consequence of that, alongside AJAX - as you mention.

2clean




msg:4012064
 7:11 am on Oct 23, 2009 (gmt 0)

I'm guessing that you can strip out any junk via mod-rewrite and clean things up.

joelgreen




msg:4012068
 7:27 am on Oct 23, 2009 (gmt 0)

There was a proposal for making ajax crawlable, but looks like post had been removed for some reason (maybe just a glitch on my end). Still can be seen via cache

cache:http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html

It looks like Google likes some identifiers after that "#".

alexchia01




msg:4012088
 8:50 am on Oct 23, 2009 (gmt 0)

I discover this trailing "#" at Google's own Knol site. It seems that every time, you type in a knol URL, Google automatically add the "#" to the URL.

I was quite puzzled over this, but there don't seems to be any problem with the web pages being loaded.

If Google is using the trailing "#" themselves, I don't think their search engine would ignore the data after "#". After all AJAX is an legitimate language.

tedster




msg:4012163
 1:27 pm on Oct 23, 2009 (gmt 0)

There was a proposal for making ajax crawlable...

Google blog [googlewebmastercentral.blogspot.com]
Our discussion [webmasterworld.com]

pageoneresults




msg:4012168
 1:38 pm on Oct 23, 2009 (gmt 0)

I'd tend to lean more towards what Receptional Andy describes above. According to protocol, the hash symbol and everything after it are supposed to be dereferenced by user-agents. Apparently Google have changed that. I guess they thought the protocol was limited and decided to start referencing those # points of entry. Hey, when you run the Internet, you can do what you want. ;)

jdMorgan




msg:4012172
 1:39 pm on Oct 23, 2009 (gmt 0)

If anyone's contemplating intentionally implementing this technique, be aware that in the context of a click on a text link of the form <a href="url">, only Apple Safari actually sends the fragment identifier to the server. Since Google's Chrome shares part of the same code-base, it may do this as well, but I haven't tested Chrome.

However, IE, Mozilla, Opera, and other members of the 'major' browser families strip the fragment identifier and send only the URL+query string to the server.

And being an old server-side-only Luddite, I also haven't investigated how the fragment identifier is handled by AJAX, or whether the observed browser behavior changes if an exclamation point is appended to the "#" (as recently proposed by Google [webmasterworld.com] to denote AJAX state names).

Jim

arieng




msg:4012261
 4:27 pm on Oct 23, 2009 (gmt 0)

Would adding a canonical tag be a more reliable way to do away with duplicate content from trailing junk?

mirrornl




msg:4012433
 9:57 pm on Oct 23, 2009 (gmt 0)

this does not apply to javascript i suppose?
like
<a href="#" onclick="javascript: window.open...etc. ?

tedster




msg:4012481
 12:11 am on Oct 24, 2009 (gmt 0)

Would adding a canonical tag be a more reliable way...

Theoretically perhaps - but "reliable" sort of means "in practice" so it's hard to say. There have not been any horror stories of the canonical tag going wrong, but I've also not read much about it providing a road out of a tough situation where the site improved their rankings and traffic.

moTi




msg:4012532
 2:49 am on Oct 24, 2009 (gmt 0)

whow, thanks for reminding us of the "dot example dot com dot" problem. never thought of that. i have problems to identify this string in my script. a "split domain by '.'" in perl leads to nowhere, because i want the identifier "dot", not the result, which in this case would be "" (empty like it was no dot at the end) for the part after the complete url. how do you combat the trailing dot (also in other progam languages and maybe apart from htaccess)? i've found no solution since one day..

bwakkie




msg:4013453
 12:03 pm on Oct 26, 2009 (gmt 0)

moTi: a rewrite rule should do it I guess

see: [webmasterworld.com...]

TheMadScientist




msg:4017918
 2:59 am on Nov 3, 2009 (gmt 0)

i have problems to identify this string in my script. a "split domain by '.'" in perl leads to nowhere, because i want the identifier "dot", not the result, which in this case would be "" (empty like it was no dot at the end) for the part after the complete url. how do you combat the trailing dot (also in other progam languages and maybe apart from htaccess)? i've found no solution since one day..

I don't write Perl, but in PHP:

<?php

echo $_SERVER['HTTP_HOST']."<br />";

if(strlen($_SERVER['HTTP_HOST'])-1 === strrpos($_SERVER['HTTP_HOST'],".")) {
echo $NoTrailingDot = preg_replace("/\.$/","",$_SERVER['HTTP_HOST']);
}

?>

OR

<?php

echo $_SERVER['HTTP_HOST']."<br />";

if($_SERVER['HTTP_HOST']) != 'www.example.com' && $_SERVER['HTTP_HOST']) != '') {
echo $NewHost = 'www.example.com';
}

?>

The first is more flexible.
The second might be more efficient.
Dunno for sure... Haven't tested.

moTi




msg:4019405
 2:40 am on Nov 5, 2009 (gmt 0)

thanks. but as i see it, the problem is that $_SERVER['HTTP_HOST'] (or at least $ENV{'HTTP_HOST'} in perl and on my apache 2) has the same result in both cases:

www.example.com -> HTTP_HOST = www.example.com
www.example.com. -> HTTP_HOST = www.example.com

should be the same outcome in htaccess, but i haven't tried yet.
again: how to detect the dot (without htaccess)?

TheMadScientist




msg:4019418
 3:38 am on Nov 5, 2009 (gmt 0)

thanks. but as i see it, the problem is that $_SERVER['HTTP_HOST'] (or at least $ENV{'HTTP_HOST'} in perl and on my apache 2) has the same result in both cases:

The php example I posted was tested prior to posting, so either there's a difference in Perl or your Apache Version. To know where the difference is, try the code I posted on your server... When tested the first 'echo' displays the . (dot) on the end of the HOST, and the second doesn't.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved