Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Will 301 redirect cure this duplicate content problem?

duplicate content problem - how to fix?

         

greyhound4334

6:08 pm on Apr 10, 2007 (gmt 0)

10+ Year Member



Hi gang,

First post, so be merciful!

I work with a site that uses ad-serving technology that serves up ads with links that include a timestamp. For example, the ad link looks something like: http://www.example.com/adserver.php?adsite=#*$!x,utime=12345678 The actual link is then 302-redirected to the advertiser site.

The timestamp is generated each time the page is served, and the ads appear on each page of the site. So when the site gets crawled, Google finds 1000's of instances of this URL -- there are only a handful of different ad sites, but each page viewed has a different timestamp, so each is seen as a separate URL. As a result, if you do a google search for allinurl:mydomain adsite, there are 1000's of URLs listed, whereas there are only a handful of unique pages involved.

Here's another wrinkle. They've disallowed /adsite/ in their robots.txt, yet these URLs are still in the Google index. Not sure how they got there in the first place (might be due to some lazy scraping?), but they're there.

So, this seems like a bit of a duplicate content problem (and there are others on this site that we're working through, so even if this isn't, inherently, a big problem I would still like to clean it up).

The question, at last:

If I switch to 301 redirects, and redirect anything beginning with http://www.example.com/adserver.php?adsite=#*$!x, regardless of the timestamp, to the target site, will this eventually remove the duplicate pages from the Google index?

The part I'm worried about is that each URL that's already in the index is unique because of the timestamp and won't be served up again, so Google will never see that link again to recrawl it and discover that it's been 301'd.

If you've read this far, thanks, and I look forward to your comments!

[edited by: tedster at 10:13 pm (utc) on April 10, 2007]
[edit reason] switch to example.com - it can never be owned [/edit]

g1smd

7:23 pm on Apr 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The 302 redirect is the problem here. Make sure that you change it to a 301 redirect.

>> Google will never see that link again to re-crawl it and discover that it's been 301'd. <<

Once Google has a URL in their database, they re-crawl the web from their list of URLs and so they will see the 301 redirect just fine. However, the old URLs will continue to show in Google's index for a year before they are deleted. You can't control that, so don't worry about it. Just make the changes to your site and look back every few months to see that progress is being made. It will take maybe a year for all of the old URLs to fade away.

greyhound4334

8:24 pm on Apr 10, 2007 (gmt 0)

10+ Year Member



That's very helpful advice, thank you gs1md.

I've given this some further thought, and have some more comments/questions you might have some thoughts on:

1. Do I really have a duplicate problem here? There are definitely many duplicate URLs in the index, however they point (via the 302) to pages off-site. As I think about this, it seems to me that what we've got is multiple URLs pointing to the same page (duplicate content, for sure), but the content is off-site, so it's not actually *my* content, per se, that's being duplicated.

And I can't imagine it's a duplicate problem for the target site either (if so, then one could easily launch a duplicate content "attack" using this technique of off-site linking).

So *is* it a problem? And if so, "who's is it"?

2. I do have a general understanding of 302 vs. 301, but someone has suggested to me that 302's are often used for this sort of thing, and may have some advantage over 301's for tracking purposes. I don't have direct control of this site, and it's proven hard to get a solid technical answer as to why they've used 302's instead of 301's, so I'm just trying to educate myself, before pushing hard on switching to 301's. Is there a conceivable benefit (vis-a-vis tracking ads served) for using 302's instead?

Thanks for your help!

g1smd

9:09 pm on Apr 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This sort of 302 redirect was a very big problem for Google about 2 years ago, but is less of a problem today. However that is not to say that the problem will forever stay fixed. Of course, it may occur with other search engines at some time.

One correction/addition to what I wrote above. I meant to say that the old URLs will continue to show in Google's index as Supplemental Results for a year before they are deleted.

jd01

1:28 am on Apr 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You might also consider adding rel="nofollow" to the links.

EG <a href="http://www.example.com/the-link-stuff" rel="nofollow">

Otherwise just follow the advice of g1smd.

Justin

BTW

Welcome to WebmasterWorld!
(The Posting Portion Anyway.)

greyhound4334

7:45 pm on Apr 11, 2007 (gmt 0)

10+ Year Member



Thanks Justin. Sounds like a good idea, and thanks for the warm welcome.

Cheers,
john

g1smd

7:50 pm on Apr 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The rel="nofollow" attribute does not stop search engines following links and indexing the content found at that target URL. It merely confers that you do not want to "vote" for that content.

jonrichd

8:55 pm on Apr 11, 2007 (gmt 0)

10+ Year Member



> They've disallowed /adsite/ in their robots.txt, yet these URLs are still in the Google index.

If indeed the robots.txt is set up the way you describe above, then it's not going to stop bots from seeing your links.

you want to disallow adserver.php instead.

(Or, possibly you just made a typing mistake). You could also use G webmaster tools robots.txt thingy to make sure your robots.txt is set up the way you want it to be.

jd01

1:02 am on Apr 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, I thought removing the "vote" for "ever-changing" affiliate links running through a bounce redirect might be a good idea, but didn't really think about it too much. Just thought I would throw the suggestion out there in case the idea hadn't been thought through by greyhound4334.

(Probably should have included your post and the preceding in my original post. Thanks for pointing it out g1smd.)

Justin

greyhound4334

11:32 pm on Apr 13, 2007 (gmt 0)

10+ Year Member



jonrichd,

Good catch. Unfortunately, it's a typo on my part. The ad link *really* looks like http://www.example.com/adsite/adserver.php?adsite=#*$!x,utime=12345678

So I think the disallow of /adsite/ should be directed properly at the adserver script, right? I don't think that's the problem.

I'm pretty sure robots.txt is functioning properly. It looks like these buggers got into the index in some other way (possibly before the robots.txt was put up? Possibly there are other links to the adserver, though that's really hard to fathom?).

Thanks for the great comments. Any other ideas?

Cheers,
john