"/?source=widgets" = duplicate page problem?

Forum Moderators: open

Message Too Old, No Replies

"/?source=widgets" = duplicate page problem?

Can tracking URLs cause problems by being regarded as duplicate content

lavapies

3:37 pm on Oct 27, 2002 (gmt 0)

Hi,

I couldn't find anything about this, although I'm sure there is.

I have several inbound links from client sites all with tracking URLs so I know where they've come from (i.e. /?source=widgets.com).

Googlebot now spiders (and lists) my page with both the tracked and standard URL. So, if this page is listed twice in the SERPs, isn't it in danger of being seen as duplicate content, even though it's the same page?

Many Thanks

Nick_W

3:40 pm on Oct 27, 2002 (gmt 0)

Yep, definate problem there. You need to stop it.

The other thing to note is that if your links back have tracking strings then not only does G think they are seperate pages (hence the dupe content issue) but, you'll not get PR allocation properly (i think....)

Nick

lavapies

3:41 pm on Oct 27, 2002 (gmt 0)

Oops, oh dear, thanks Nick_W .. /quickly goes to alter links :-(

thunderpaste

3:44 pm on Oct 27, 2002 (gmt 0)

You need a robots.txt file to tell Googlebot not to index the tracking pages. I use this method the same way and google has nevr indexed the tracking pages, only the main index.

lavapies

3:47 pm on Oct 27, 2002 (gmt 0)

Thanks thunderpaste - great idea - will I still get the benefit of the inbound links that way do u think?

thunderpaste

3:47 pm on Oct 27, 2002 (gmt 0)

Also I dont think there is a pagerank problem. Your index will get its full deserved rank. If you come in on a tracking page excluded by robots.txt the PR will likely shoe 1 less because google is "guessing" the PR.

When I add a new page google will guess at the PR and it seems to always be the same PR as the pages I have excluded.

thunderpaste

3:50 pm on Oct 27, 2002 (gmt 0)

Thats a good question. I am not sure if you get the benefit of the links. I would think no because they dont show up on a link:www.widgets.com search query.

lavapies

3:53 pm on Oct 27, 2002 (gmt 0)

thanks again thuderpaste - so just a disallow in robots should do the trick by the sound of what you're saying.. which is easier than annoying my client on a sunday :-)

Nick_W

3:54 pm on Oct 27, 2002 (gmt 0)

I still get the benefit of the inbound links that way do u think?

No, you'll lose that value.

Nick

lavapies

3:58 pm on Oct 27, 2002 (gmt 0)

ok thanks guys - i get it now - so basically better off getting the inbound links altered to standard URLs to get the PR benefit, but for now I'll just use robots as a temporary measure until I can hassle clients with minimum fuss :-)

NameNick

3:59 pm on Oct 27, 2002 (gmt 0)

lavapies,

I have got a similar situation with links to several template versions of each page (e.g. www.....com/page.php?template=blue etc.). To avoid of getting duplicate content in SERPs I let add server side with php the robots noindex meta tag in each template (blue, red, grey and printer friedly one) if the variable 'template' is shown in the url.

I see the se robots are crawling every single URL - including the ones with the?template=... parameter. And as I wished, those pages do not occour in search engines.

Googlebot now spiders (and lists) my page with both the tracked and standard URL. So, if this page is listed twice in the SERPs, isn't it in danger of being seen as duplicate content, even though it's the same page?

Did your pages already appeared twice in the SERPs? Google recognize duplicate content and can handle it quite well. The above described method is mainly for other SEs (e.g. Alltheweb).

Annii

4:00 pm on Oct 27, 2002 (gmt 0)

Sorry to sound dim here, but can anyone show me exactly what to put in the robots.txt file to exclude any urls with a?source= in them?

Thanks

Anni

lavapies

4:03 pm on Oct 27, 2002 (gmt 0)

NameNick,

Sounds like another good method.. thanks for the insight.. Unfortunately one of my tracking URLs has been indexed already by G, so I'm just hoping it won't have any effect for now.. especially as my business is basically dependent on 1 site alone!

Anni - add "Disallow: /?source=widgets" to your robots.txt file

Thanasus

4:11 pm on Oct 27, 2002 (gmt 0)

Ack! I pass session IDs in the query string in case the websurfer has cookies turned off. If I'm understanding this correctly, than I could be screwed?

Nick_W

4:23 pm on Oct 27, 2002 (gmt 0)

If you're running PHP you can use this:


[pre]
<?
/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Atomz", "Scooter");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
  if(strstr("$HTTP_USER_AGENT", $val)) {
    $is_search_engine++;
  }
}
if($is_search_engine==0) { // Not a search engine
  
  /* You can put anything in here that needs to be
  hidden from searchengines */
  session_start();
  
} else { // Is a search engine
  
  /* Put anything you want only for searchengines in here */
  $foo=$bar;
  
}?>
[/pre]

Nick

Annii

5:06 pm on Oct 27, 2002 (gmt 0)

Thanks lavapies, is there a way to have some kind of wildcard before and after source=?

eg I will need to disallow lots of pages each with lots of possible source strings, so if I had say 100 pages each with say 10 possible source strings, that would be a massive robots.txt file....

Anni

lavapies

5:16 pm on Oct 27, 2002 (gmt 0)

Annii, I'm pretty sure you can put in something like "Disallow: /?source=" and it will disallow all files that begin with '/?source='.

Like, for example, if you use Disallow: /help - then help.html would also be disallowd.

This is how I remember it anyway, but may be worth checking first. Do a search for robots.txt to make sure.

Thanasus

5:42 pm on Oct 27, 2002 (gmt 0)

Nick_W,

Technically, couldn't a search engine misperceive that as cloaking and automatically penalize for it?

I know that it is far from the nefariousness of cloaking but it technically is cloaking and I would think that if search engines have sniffer scripts to detect cloaking [ie, do a character comparison of the same URL when called from different IPs or with different user agents] wouldn't that set off the cloaking flag?

Nick_W

5:46 pm on Oct 27, 2002 (gmt 0)

It's undetectable. Unless you do it by hand and even then it's tricy, you'd have to turn cookies off and compare results with cookies on... unlikely. and besides, with a hand check, they'd see the point.

PHP is server-side, UA's don't see it.

Nick

Thanasus

6:43 pm on Oct 27, 2002 (gmt 0)

The fact that php is server side is what concerns me. While we both know that the point isn't to cloak the page, is a search engine has anti-cloaking software like I described above, than it could perceive the page as using cloaking technology and ban it.

Character by character comparison would yield a difference. Now, one would hope it would be smart enought to idnetify a session ID but that isn't necessarily the case. I've seen e-commerce sites where products IDs are long alphanumerics and could resemble a sessions ID. Plus, if you write your own session wrapper code (because session IDs in php are known to have inherent limitations), you may further cause character differentiations.

This might sound like an extremist view but in an era where dynamic DB-driven websites are becoming commonplace, search engines are going to have to take steps to prevent mirroring while also keepingn an eye out for cloaking.

Personally, I don't see a way to do it successfully.

Nick_W

6:54 pm on Oct 27, 2002 (gmt 0)

Nah, I think you're worrying too much ;)

Nick

NameNick

7:20 pm on Oct 27, 2002 (gmt 0)

Thanasus

ie, (search engines) do a character comparison of the same URL when called from different IPs or with different user agents

I think doing this is quite useless because many sites are showing random content like text ads, article teasers etc. Those sites deliver different pages on every new page view. And what is about sites that are showing date and time. How will the software (and a search engine robot is nothing else) ever know whether a webmaster want to cheat or just show the time?

One of my web projects shows random content since '98 and it has got never problems with search engines.

Additionally I think the robots are a little bit too busy to crawl a site twice. Instead of crawling 2,400,000,000 pages the Googlebot would have to crawl 4,800,000,000 pages.

And as long as I see loads of site with hidden text and other spammy things in the SERPs and as long as Google need the users to report spamming via a webform on google.com, I think of a search engine as a maschine made of hardware and software which a website only sees as a checksum and a set of characters. ;)

Nick_W

7:25 pm on Oct 27, 2002 (gmt 0)

Damn good points NameNick!

I was just to lazy to give a real response ;)

Nick