Forum Moderators: open
I couldn't find anything about this, although I'm sure there is.
I have several inbound links from client sites all with tracking URLs so I know where they've come from (i.e. /?source=widgets.com).
Googlebot now spiders (and lists) my page with both the tracked and standard URL. So, if this page is listed twice in the SERPs, isn't it in danger of being seen as duplicate content, even though it's the same page?
Many Thanks
When I add a new page google will guess at the PR and it seems to always be the same PR as the pages I have excluded.
I have got a similar situation with links to several template versions of each page (e.g. www.....com/page.php?template=blue etc.). To avoid of getting duplicate content in SERPs I let add server side with php the robots noindex meta tag in each template (blue, red, grey and printer friedly one) if the variable 'template' is shown in the url.
I see the se robots are crawling every single URL - including the ones with the?template=... parameter. And as I wished, those pages do not occour in search engines.
Googlebot now spiders (and lists) my page with both the tracked and standard URL. So, if this page is listed twice in the SERPs, isn't it in danger of being seen as duplicate content, even though it's the same page?
Did your pages already appeared twice in the SERPs? Google recognize duplicate content and can handle it quite well. The above described method is mainly for other SEs (e.g. Alltheweb).
NN
Sounds like another good method.. thanks for the insight.. Unfortunately one of my tracking URLs has been indexed already by G, so I'm just hoping it won't have any effect for now.. especially as my business is basically dependent on 1 site alone!
Anni - add "Disallow: /?source=widgets" to your robots.txt file
[pre]
<?
/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Atomz", "Scooter");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
if(strstr("$HTTP_USER_AGENT", $val)) {
$is_search_engine++;
}
}
if($is_search_engine==0) { // Not a search engine
/* You can put anything in here that needs to be
hidden from searchengines */
session_start();
} else { // Is a search engine
/* Put anything you want only for searchengines in here */
$foo=$bar;
}
?>
[/pre]
Nick
Like, for example, if you use Disallow: /help - then help.html would also be disallowd.
This is how I remember it anyway, but may be worth checking first. Do a search for robots.txt to make sure.
Technically, couldn't a search engine misperceive that as cloaking and automatically penalize for it?
I know that it is far from the nefariousness of cloaking but it technically is cloaking and I would think that if search engines have sniffer scripts to detect cloaking [ie, do a character comparison of the same URL when called from different IPs or with different user agents] wouldn't that set off the cloaking flag?
Character by character comparison would yield a difference. Now, one would hope it would be smart enought to idnetify a session ID but that isn't necessarily the case. I've seen e-commerce sites where products IDs are long alphanumerics and could resemble a sessions ID. Plus, if you write your own session wrapper code (because session IDs in php are known to have inherent limitations), you may further cause character differentiations.
This might sound like an extremist view but in an era where dynamic DB-driven websites are becoming commonplace, search engines are going to have to take steps to prevent mirroring while also keepingn an eye out for cloaking.
Personally, I don't see a way to do it successfully.
ie, (search engines) do a character comparison of the same URL when called from different IPs or with different user agents
I think doing this is quite useless because many sites are showing random content like text ads, article teasers etc. Those sites deliver different pages on every new page view. And what is about sites that are showing date and time. How will the software (and a search engine robot is nothing else) ever know whether a webmaster want to cheat or just show the time?
One of my web projects shows random content since '98 and it has got never problems with search engines.
Additionally I think the robots are a little bit too busy to crawl a site twice. Instead of crawling 2,400,000,000 pages the Googlebot would have to crawl 4,800,000,000 pages.
And as long as I see loads of site with hidden text and other spammy things in the SERPs and as long as Google need the users to report spamming via a webform on google.com, I think of a search engine as a maschine made of hardware and software which a website only sees as a checksum and a set of characters. ;)
NN