I have some pages on my site that are currently indexed as URLs only in google. I've been blocking them in the robots.txt file, so that's why they're listed as URL-only.
They're mainly pages of date-based archives created by my blog software; in other words, duplicate pages of stuff that can already be found in the index under the individual blog entry urls themselves. I'd thought it was therefore best to robots.txt them.
However, I've just read this in google's webmaster guidelines and I've been puzzling over it:
"Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages."
So, google wants to see everything on your site. Fair enough. But if you do block the URLs in robots.txt, I don't understand the "treating them as separate, unique pages" bit. Surely if they're blocked URLs, google doesn't see any of the content of the page anyway? What does it imply, then, when it says it has to "treat them as separate, unique pages"?
Any help gratefully received.