Google Followed many duplicate links, how to solve problem?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Followed many duplicate links, how to solve problem?

Ma2T

10:43 pm on Sep 17, 2006 (gmt 0)

Hey there,

Im hoping you can offer some advice, I have been trying out a tagging system, but decided not to use it as I felt it would just lead to duplicate content.

But now disaster, I didn't realise some that links to these tags were automatically created, and now google has been all over them! Disaster!

In the last 24 hours, I checked logs and google has been on about 200 of these links.

All like this: (eg)
/index.php?tag=word1
/index.php?tag=word2
/index.php?tag=word3

I have removed all links to these urls, but google has already been all over them.

I do NOT want these in google!, I can remove the tag system, so all of these links will go to a 404 page. Or should I set up some kind of redirect to the main page?

Or maybe write something in the robots.txt file? although would this be too late as google has already been on these links and active pages?

Many thanks for your advice.
Matt

g1smd

11:00 pm on Sep 17, 2006 (gmt 0)

Whatever you do it will now take up to a year for Google to drop them; but that will not be a problem if you follow the right steps.

The problem will continue if all of the alternative URLs for the content are allowed to be indexed again and again.

There are three ways to fix this. Use whichever one you want. Either:
- modify the script so that it detects the requested URL and forces a 301 redirect to the canonical form, or,
- modify the script so that it detects the requested URL and adds a meta robots noindex tag on all unwanted formats, or
- add a disallow statement in the robots.txt file to disallow the unwanted forms.

You must use one of the above options to tell Google to no longer index the alternative URLs.

What will happen next is that they will get marked as Supplemental Results (they may actually fall out of the normal index within weeks, and then reappear as a Supplemental Result in a few months time) and will hang around in the index for another year.

As long as the unwanted URLs return either a noindex, or a 301 redirect, you will cure the problem. Fix the problem with one of those, then forget about it.

Ma2T

11:50 pm on Sep 17, 2006 (gmt 0)

Thanks for your reply g1smd.

In my panic I have read up on the robots info, and did much of what you recommend.

On the pages I do not want index I have added in the following Meta Tag.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW, NOARCHIVE, NONE">

But these pages will have to be recrawled, which would take some time, so I also did added the following to my robots.txt

User-agent: *
Disallow: /*?tag=*

I checked this on google sitemaps tool, and it seems to work correctly.

You mention that the pages will still get indexed, probably as supplemental results, this is still bad no? As they are duplicate results.

The pages are still physically there, I could also add a redirect from these pages to the index or something. Would this help also?

Any way to eliminate the supplemental listings?

Thanks again for your reply and help :)

g1smd

12:01 am on Sep 18, 2006 (gmt 0)

To be clear, you cannot immediately fix Supplemental Results. Google hangs on to them for one year after you make the changes on your site.

Use the meta tag to get the alternative URLs deindexed. Ensure that the noindex meta tag does NOT appear on the version that you do want to still show up in the index.

If you use the robots.txt file, then you cannot use the meta tag too. The meta tag will never be seen if robots.txt says that Google should not re-crawl the file. Use one or the other, but not both.

In robots.txt, the Disallow: /*somename notation (with a * in it) can only be used with the Googlebot User-agent. Do not use it with the User-agent: * wildcard.

If there is a User-agent: Googlebot section, Google will totally ignore anything in the User-agent: * section. Make sure that everything that Google should be doing is all in the User-agent: Googlebot section.

However, I would use the noindex meta tags, as this usually ensures complete delisting from the index. URLs that are merely disallowed by robots.txt still often appear in the SERPs as URL-only entries.

As for the duplicate content problem; this is only a problem if two or more URLs return "200 OK" for the same content, and all of them are allowed to be indexed. Very soon after (a matter of a few weeks I feel) you have tagged the alternative URLs to no longer be indexed, or have installed a redirect, they are no longer treated as being duplicates.

Ma2T

12:14 am on Sep 18, 2006 (gmt 0)

Thanks,

I have made the changes to the robots.txt file.

I added this file as I assume it would be quicker than waiting for google to re-visit all of these pages again (if it ever does, as all links to it have been removed). Although if you say Meta is better, I will consider removing this from the robots.txt file.

Also one last question, I could physical remove these pages/urls from my site completely so they no longer exsit (will go to 404). Would this be wise? or would it be smarter to just leave them there with the meta for noindex.

Thanks a lot for your help g1smd, very much appreciated!

...

Thanks for clearing up the information about duplicate content, this is nice to hear :) as this is my main worry!. Cheers

[edited by: Ma2T at 12:16 am (utc) on Sep. 18, 2006]

GuinnessGuy

12:15 am on Sep 18, 2006 (gmt 0)

Cheers,

Assuming that these pages are duplicates of the page without the query strings, why can't he just use the removal tool? Of course, if there are too many of these already it's going to take too long, but assuming there aren't, can't he just remove them almost immediately through the removal tool?

After doing this, if he is unsure about future tags(perhaps these are affiliate tags?) can't he then do a ISAPI re-write(or the equivalent for Apache) so that all future tags are 301'd to the tagless URL?

GuinnessGuy

g1smd

12:22 am on Sep 18, 2006 (gmt 0)

>> Assuming that these pages are duplicates of the page without the query strings, why can't he just use the removal tool? <<

You could use the removal tool... but you need to be aware that it is not a "removal" tool. The URLs are merely "hidden" for 180 days and then they re-appear as if nothing had ever happened.

So, you need the robots stuff to get them deindexed, and you are better off to let them go Supplemental, and fade away on their own. While they are supplemental they may well rank for some phrases, and will still deliver a few visitors.

>> After doing this, if he is unsure about future tags (perhaps these are affiliate tags?) can't he then do a ISAPI re-write (or the equivalent for Apache) so that all future tags are 301'd to the tagless URL?

Yes, the 301 redirect is another way to do this. There are several options, each with advantages and disadvantages.

Ma2T

3:34 pm on Sep 18, 2006 (gmt 0)

Thanks for the advice people.

I have removed the restriction in the robots.txt file, and will go with the NOINDEX meta tag on the tag pages.

Hopefully this will cause me no problems, as I am very worried about the duplicate content penalty, I was hit by the last google update a few days ago and lost about 90% of my traffic, I am praying for a return in the positions in the next update.

Thanks

Ma2T

9:40 pm on Sep 23, 2006 (gmt 0)

Im sorry to bring this up again, and im wondering if you could help me out again?

Unfortunalty, for some reason unknown I am now unable to add in the meta tag to these pages. I use a system to semi automate things, and now I can't add this meta to the required pages and not others :/

I have about 150 or so tag pages in google now, my fear. As I can't use the meta tags now, I added the following to my robots.txt file

User-agent: Googlebot
Disallow: /*?tag=*

I want to remove this tagging system completely as to stop anymore pages being created and added. This will leave 404's.(all links to the 404 no longer exist) Would this be extremely bad?, I could try to redirect these to the main url, but unfortunately I can't redirect the taged ones to the non taged url as they are not 100% the same.

Thanks again for your help

Ma2T

11:20 pm on Sep 23, 2006 (gmt 0)

Im a little slow when it comes to this RedirectMatch business, and am getting a headache trying to work it out.

does anyone know the code to redirect any url containing "/?tag=" to "/"?

Many thanks guys/

ascensions

4:23 am on Sep 24, 2006 (gmt 0)

RewriteCond %{QUERY_STRING} tag=([0-9]+)
RewriteRule ^index.php$ [yourdomain.com?...] [R=301,L]

walkman

4:45 am on Sep 24, 2006 (gmt 0)

I have this for domain.com?tag1


RewriteEngine on 
RewriteCond %{QUERY_STRING} ^tag1$ [OR]
RewriteCond %{QUERY_STRING} ^tag2$ 
RewriteRule ^$ http://www.domain.com/? [R=301,L]

and it works perfectly...does anyone know how do I make it so anything after /?* goes to the dom.com/? Also, is there a safe way to redirect index.php to the root / without creating a loop?

thanks,

g1smd

1:13 pm on Sep 24, 2006 (gmt 0)

I would take those questions to the Apache forum here at WebmasterWorld where a larger quantity of expert eyeballs are more likely to notice it. :-)