Crossing the duplicate content threshold with RSS feed parsing?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Crossing the duplicate content threshold with RSS feed parsing?

website

7:24 pm on Oct 20, 2008 (gmt 0)

Obviously, we all want more hands free content.
My question to those with experience is: "What is the threshold ?"

I want to be certain that I am not "over doing" it in terms of rss feed parsing.

Is there some rule of thumb or guideline when gauging how much of your content should come from external sources? 80/20, 70/30, etc?

What tips google off, and when do they start caring?

Any input on this topic will be greatly appreciated!

tedster

1:08 am on Oct 21, 2008 (gmt 0)

Content that is duplicated from another domain will usually just be filtered out of the search results, since Google tries to return oly one version, and they try for the original source of the content (but not always successfully.) This is normal, and that kind of duplicate content filtering does not cause a penalty flag against your site and your original content will still be able to show in the search results.

Rarely, Google will completely exclude a domain from their index because an extremely high amount of content is taken from other domains. This kind of penalty/ban takes more than parsing RSS feeds to get levied. It usually requires intentional scraping with the hopes of spamming the search results, and little to no original content.

Some webmasters casually use words like "duplicate content penalty" but legitimate sites almost never see a true penalty for duplicate content. There certainly can be ranking problems because of it, especially when you have duplicate urls on your own domain for the same content.

For more details, check out the duplicate content section of the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page.

website

1:49 am on Oct 21, 2008 (gmt 0)

Awesome response, thank you for that!

On that same note though; what method is actually used to define if a site is "legitimate" or not?

I hear that term, but what exactly does that mean?

Let's say for sake of argument that I had beautifully designed blog with no ads or revenue stream. If %100 of the content I post comes from multiple, highly relevant rss feeds; would I be considered ill-legitimate by Google?

tedster

2:33 am on Oct 21, 2008 (gmt 0)

I don't think Google would want to rank those pages - though sometimes a few might slip through on a search here or there, but that's often a bug. The mere aggregation of other website's content doesn't add value for the Google user.

website

2:57 am on Oct 21, 2008 (gmt 0)

Is there a metric that is used to safely gauge the percent of recycled content vs. the amount of original content?

70/30, 60/40, etc?

tedster

4:19 am on Oct 21, 2008 (gmt 0)

What do you mean by "safely"? What are you hoping to avoid?

website

3:20 pm on Oct 21, 2008 (gmt 0)

Basically, I'm trying to avoid the problems caused with having "too much" unoriginal content.

So (rephrased) what I'd like to know is:

If I am parsing rss feeds on my blog, as well as writing original content -- what percentage of my total content could I create using the rss feeds, without gaining negative attention from Google.

(from your response above, it seems as though people can get away with quite a bit - but I am just curious to where exactly the line is drawn)

tedster

7:01 pm on Oct 21, 2008 (gmt 0)

If you have quality original content, those pages should be indexed. The only problem you should see is that your pages generated from other site's RSS feeds will get filtered out of search results.

If a given feed is useful for your visitors, I'd have no concerns. If you are concerned, you could use a noindex,follow robots meta tag on the page to make it clear that you are not being tricky or deceptive in any way.

website

7:12 pm on Oct 21, 2008 (gmt 0)

ahhh... I finally get it.
So it really doesn't matter what I do in terms of duplicate content... bottom line is: If Google doesn't consider it original - it's not getting indexed.

Now this open's the door for a new question: To what degree does a feed need to be altered to be viewed by Google as original content?

Any thoughts?

bw3ttt

7:21 pm on Oct 21, 2008 (gmt 0)

There are countless sites on the web that are essentially 99.9% duplicate content from Shopping APIs and product databases. They do not get banned from the index, but are effectively supressed to no man's land.

Go to shopping.com and cut and paste a paragraph of text into the Google search box. You will see hundreds of sites in the index with the exact same paragraph.

By having duplicate content you haven't actually done anything wrong so you don't deserve to be banned, but you don't deserve to outrank the original source either. You'll be buried in 300th place because people don't want results 1-10 to be the same thing.

As for your site, I would robots.txt your duplicate content UNLESS you have significant original content to go along with your dupe content ON THE SAME PAGE. If you have someone else's article on your site you should write a summary or some general thoughts on the article and have readers add comments.

There is no ratio of original to duplicate content that will cause you to remain in the index. Duplicating content is gray hat and is considered a tolerable nuisance by G. They can afford the hard drive space.

bw3ttt

7:21 pm on Oct 21, 2008 (gmt 0)

Google doesn't consider it original - it's not getting indexed.

yes, it will get indexed, but will be buried way, way down in the SERPs.

website

8:24 pm on Oct 21, 2008 (gmt 0)

Could you effectively rearrange or synonomize the incoming feed so that when it gets indexed, it looks original?
Or does google have the capability of detecting that somehow?

bw3ttt

9:22 pm on Oct 21, 2008 (gmt 0)

No, take a sentence from an article from one month ago and search for it and you'll see thousands of sites with the same text.

Rearranging the text would be considered spam.

website

10:00 pm on Oct 21, 2008 (gmt 0)

Does Google consider synonyms spam as well?

moftary

12:43 am on Oct 22, 2008 (gmt 0)

you don't deserve to be banned, but you don't deserve to outrank the original source either.

Actually in some cases - not rarely, not mostly - the duplicated content might outrank the original content. There are so many metrics Google takes into account for ranking the very same content other than who was indexed first.

There is no ratio of original to duplicate content that will cause you to remain in the index.

I believe there is but that information is surely not publicly available.

tedster

1:33 am on Oct 22, 2008 (gmt 0)

Does Google consider synonyms spam as well?

That can happen, especially if you insert synonyms through a progam rather than write naturally. See this Google patent:
Detecting spam documents in a phrase based information retrieval system [appft1.uspto.gov]

website

1:26 pm on Oct 23, 2008 (gmt 0)

excellent information - thank you to all who have contributed.

Google takes into account for ranking the very same content other than who was indexed first.

Do you think any of this has to do with Google determining which page would ultimately make them the most money?

website

1:28 pm on Oct 23, 2008 (gmt 0)

oops - meant to quote this whole line:

There are so many metrics Google takes into account for ranking the very same content other than who was indexed first.

tedster

5:03 pm on Oct 23, 2008 (gmt 0)

In the long run, sure. What makes Google money is making sure their end users find the search results useful. That's their business plan. Without user confidence in the search results, the whole house of cards eventually falls down.

I'm not one of those who see Google as pushing the results in the direction of clicking on Adwords or sites that have AdSense. I'm aware that others feel they do this, but 1) Google states publicly that they don't and 2) doing that would undermine them in the long term. I see a lot of data all the time and there's nothing in the big picture that points me to that conclusion, only small, local anomalies.