Welcome to WebmasterWorld Guest from 220.127.116.11
I want to be certain that I am not "over doing" it in terms of rss feed parsing.
Is there some rule of thumb or guideline when gauging how much of your content should come from external sources? 80/20, 70/30, etc?
What tips google off, and when do they start caring?
Any input on this topic will be greatly appreciated!
Rarely, Google will completely exclude a domain from their index because an extremely high amount of content is taken from other domains. This kind of penalty/ban takes more than parsing RSS feeds to get levied. It usually requires intentional scraping with the hopes of spamming the search results, and little to no original content.
Some webmasters casually use words like "duplicate content penalty" but legitimate sites almost never see a true penalty for duplicate content. There certainly can be ranking problems because of it, especially when you have duplicate urls on your own domain for the same content.
For more details, check out the duplicate content section of the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page.
On that same note though; what method is actually used to define if a site is "legitimate" or not?
I hear that term, but what exactly does that mean?
Let's say for sake of argument that I had beautifully designed blog with no ads or revenue stream. If %100 of the content I post comes from multiple, highly relevant rss feeds; would I be considered ill-legitimate by Google?
So (rephrased) what I'd like to know is:
If I am parsing rss feeds on my blog, as well as writing original content -- what percentage of my total content could I create using the rss feeds, without gaining negative attention from Google.
(from your response above, it seems as though people can get away with quite a bit - but I am just curious to where exactly the line is drawn)
If a given feed is useful for your visitors, I'd have no concerns. If you are concerned, you could use a noindex,follow robots meta tag on the page to make it clear that you are not being tricky or deceptive in any way.
Now this open's the door for a new question: To what degree does a feed need to be altered to be viewed by Google as original content?
Go to shopping.com and cut and paste a paragraph of text into the Google search box. You will see hundreds of sites in the index with the exact same paragraph.
By having duplicate content you haven't actually done anything wrong so you don't deserve to be banned, but you don't deserve to outrank the original source either. You'll be buried in 300th place because people don't want results 1-10 to be the same thing.
As for your site, I would robots.txt your duplicate content UNLESS you have significant original content to go along with your dupe content ON THE SAME PAGE. If you have someone else's article on your site you should write a summary or some general thoughts on the article and have readers add comments.
There is no ratio of original to duplicate content that will cause you to remain in the index. Duplicating content is gray hat and is considered a tolerable nuisance by G. They can afford the hard drive space.
you don't deserve to be banned, but you don't deserve to outrank the original source either.
Actually in some cases - not rarely, not mostly - the duplicated content might outrank the original content. There are so many metrics Google takes into account for ranking the very same content other than who was indexed first.
There is no ratio of original to duplicate content that will cause you to remain in the index.
I believe there is but that information is surely not publicly available.
Does Google consider synonyms spam as well?
That can happen, especially if you insert synonyms through a progam rather than write naturally. See this Google patent:
Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]
I'm not one of those who see Google as pushing the results in the direction of clicking on Adwords or sites that have AdSense. I'm aware that others feel they do this, but 1) Google states publicly that they don't and 2) doing that would undermine them in the long term. I see a lot of data all the time and there's nothing in the big picture that points me to that conclusion, only small, local anomalies.