|RSS News Feeds: Can They Be Duplicate Content?|
They couldn't be, surely...?
Our sites have many hundreds of RSS news feeds (we cover news from many manufacturers and on many niche topics, and provide a feed for each). Google - unfortunately - insists on indexing these feeds, and often presents them quite high in the SERPs. Unfortunately I don't have any obvious way of de-indexing them, as there's no way of putting a NOINDEX tag in an RSS news feed XML file, and I can't move them into their own directory from which I could exclude access in robots.txt.
My question is this, however: could they be considered duplicate content, especially as we have standard HTML pages covering the same headline listings?
Certainly if the text is identical, one or the other version may get filtered from the search results. Google doesn't want to show the same thing twice in a SERP. You mentioned that your feed is ranking well -- is the html page anywhere to be seen?
Because googlebot supports wildcards, you could use a line like the one below in your robots.txt to block all .rss files. The * will match any preceding character strings whatsoever, and the $ character specifies matching the end of the URL. So to block all URLs that end with .rss:
Google reference [google.com]
If you do that, you can hope your html page will appear in those results instead of the rss. But it is just a hope -- no guarantee that the html page is going to rank as well as the rss feed did. Google is better at handling the rss/html issue these days, but stuff still happens -- and maybe there are other factors with the html page that are depressing the rank you want to see.
So to block all URLs that end with .rss:
The problem with this approach is that not everybody's RSS feed's URL ends with .rss It can as well be .xml, rss.php or /feed/ or a combination of the above so there is no general good way to disable it. One has to see what his/her RSS URL is to disable it.
I think that the problem of RSS should be treated differently though. The whole point of having an RSS feed is to deliver your content via more means than just HTML. In this case it just does not make sense to block it after all the trouble of setting it up. Just be smart about your RSS feed - only put short snippets of your content, not the whole thing there because otherwise why would they ever go to your site anymore?
If the URL is exactly the same, then there is no problem. If the URL in the RSS feed is different in some way, then you can probably block it using robots.txt. That's exactly what I do.
And yes, I do believe you can generate duplicate content with RSS feeds in addition to things like PDF file creation.
|I think that the problem of RSS should be treated differently though. |
A start would be for Google to label feeds as such. I'm getting a bit tired of clicking on feed results when I'm searching for pages.
We know G can differentiate file types, index 'em, label 'em, then ignore 'em as far as any dupe filters or penalties.
|You mentioned that your feed is ranking well - is the html page anywhere to be seen? |
It's often right next to it as an indented result, but the HTML page is usually the second one (grrr...).
|Because googlebot supports wildcards, you could use a line like the one below in your robots.txt to block all .rss files. |
Tedster, you are my hero. I've always just gone back to basics and looked up original standards, and robots.txt has been no exception. I had no idea that Googlebot supported extensions to the standard, like wildcard matching, until now. D'oh! What an idiot. This is why we love Webmaster World so. Thank you.