Forum Moderators: Robert Charlton & goodroi
My question is this, however: could they be considered duplicate content, especially as we have standard HTML pages covering the same headline listings?
------------
Because googlebot supports wildcards, you could use a line like the one below in your robots.txt to block all .rss files. The * will match any preceding character strings whatsoever, and the $ character specifies matching the end of the URL. So to block all URLs that end with .rss:
User-Agent: Googlebot
Disallow: /*.rss$
Google reference [google.com]
If you do that, you can hope your html page will appear in those results instead of the rss. But it is just a hope -- no guarantee that the html page is going to rank as well as the rss feed did. Google is better at handling the rss/html issue these days, but stuff still happens -- and maybe there are other factors with the html page that are depressing the rank you want to see.
So to block all URLs that end with .rss:User-Agent: Googlebot
Disallow: /*.rss$
The problem with this approach is that not everybody's RSS feed's URL ends with .rss It can as well be .xml, rss.php or /feed/ or a combination of the above so there is no general good way to disable it. One has to see what his/her RSS URL is to disable it.
I think that the problem of RSS should be treated differently though. The whole point of having an RSS feed is to deliver your content via more means than just HTML. In this case it just does not make sense to block it after all the trouble of setting it up. Just be smart about your RSS feed - only put short snippets of your content, not the whole thing there because otherwise why would they ever go to your site anymore?
And yes, I do believe you can generate duplicate content with RSS feeds in addition to things like PDF file creation.
I think that the problem of RSS should be treated differently though.
A start would be for Google to label feeds as such. I'm getting a bit tired of clicking on feed results when I'm searching for pages.
We know G can differentiate file types, index 'em, label 'em, then ignore 'em as far as any dupe filters or penalties.
You mentioned that your feed is ranking well - is the html page anywhere to be seen?
It's often right next to it as an indented result, but the HTML page is usually the second one (grrr...).
Because googlebot supports wildcards, you could use a line like the one below in your robots.txt to block all .rss files.
Tedster, you are my hero. I've always just gone back to basics and looked up original standards, and robots.txt has been no exception. I had no idea that Googlebot supported extensions to the standard, like wildcard matching, until now. D'oh! What an idiot. This is why we love Webmaster World so. Thank you.