Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Using robots.txt for dupes - not recommended by google?

         

arthur22

5:24 pm on Jan 30, 2012 (gmt 0)

10+ Year Member



I have some pages on my site that are currently indexed as URLs only in google. I've been blocking them in the robots.txt file, so that's why they're listed as URL-only.

They're mainly pages of date-based archives created by my blog software; in other words, duplicate pages of stuff that can already be found in the index under the individual blog entry urls themselves. I'd thought it was therefore best to robots.txt them.

However, I've just read this in google's webmaster guidelines and I've been puzzling over it:

"Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages."

So, google wants to see everything on your site. Fair enough. But if you do block the URLs in robots.txt, I don't understand the "treating them as separate, unique pages" bit. Surely if they're blocked URLs, google doesn't see any of the content of the page anyway? What does it imply, then, when it says it has to "treat them as separate, unique pages"?

Any help gratefully received.

tedster

5:38 pm on Jan 30, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's one of the places that information is published: Duplicate content [support.google.com]

I've noticed that directive in some of JohnMu's comments lately, as well as those from Pierre Far. I'm personally not convinced they've got it right at the moment and it seems like a potentail risk at least for some sites. However I would say if there is a URL-only listing it's certainly worth requesting a removal.

g1smd

5:44 pm on Jan 30, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages.

Note to Google: If you can't crawl them, then treat them as if they don't exist. KThx.

arthur22

7:05 pm on Jan 30, 2012 (gmt 0)

10+ Year Member



Thanks Tedster - so, can URL-only listings hurt your site? I'm looking into a plugin that would noindex pages with a meta tag, which should remove the URL-onlys. I have around 50% of my site as URL-only listings, simply because of the blog software and the use of robots.txt. Is that doing any harm?

tedster

7:41 pm on Jan 30, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you want Google to see a noindex meta tag, then you must remove the robots.txt rule so they can crawl the page.