I think there must be some sort of cut off, whereby afterwards a filter is applied. A few paragraphs of duplicate text on a large site will probably be ok - a site where a large percentage of text is duplicated will probably be filtered.
The question is at what level google applies its filter - the answer is - I don't know!
I suppose scrapers would like to know exactly where the limits are drawn.
Then they could stop just short, and avoid being penalized.
The victims of these practices should be glad if this is a grey area.
Personally, I don't mind a copied sentence or two, as long as there is a valid link back.
Problems arise when scrapers take half a page of text from a hundred different sites.
Then they can claim that only a small percentage is taken from any given site,
even if they write virtually nothing original.
So far, the engines appear to have let this practice slide, and that's a shame. -Larry
The problem is scrapers can easily handle this. I noticed a few very clever scrapers today. They take a line or two from a site, make sure the domain name isn't there. No links and boom, unique content. I happen to catch them using unique words from my site.
Reason I'm asking is that on my site I got different versions of the same content. Just comes with the tools I use. The printer friendly (which doubles as search engine friendly) pages are rarely used. And the actively used content is not that optimized for google.
I've heard on a thread here and always surmised that Google will use toolbar and user data to determine what pages are being visited to give them preference. So with that thinking in mind, I would use robots.txt to disable the SE friendly pages. But that would be a shame for obvious reasons.
At the end of the day it just seems to me that Google is handing out penalties in an unfair way. Maybe they need to, I don't know, but it seems like they could just rank the answers as best they can and apply the penalty to a page within the site rather than penalizing the whole site. If that makes sense.
>Has anyone done some serious analysis into the extent of the Duplicate content filter?<
Clark - are you only referring to dup. from other sites (scrapers, etc) or from your own site? Google seems to think I have dup. content on my own site, but I don't think I do. (non-www 301s in plae and working)
I'm talking about dupe content from your own site only.
"make sure the domain name isn't there. No links and boom, unique content."
interesting idea there. do you have an evidence that google recognizes dupe content that links to the orignial, seems like lots of blogs quote and link to orignial blog posts with no problems.
The duplicate content whitepapers that have come from the google camp over the last few years seem to indicate that it has a fairly high threshold of similarity before tripping a filter. ie 90% or more non-unique content. and that urls/directory structures play a large role in this.
but of course the issue becomes the processing power needed to find and identify all the duplicates and the originals.
I am looking at the following problem.
- We have a content site of about 20,000 articles.
- We have approxixmately 700 index pages which act as a site map to the articles. There are approximately 30 articles per page. The index pages consist of snippets (first few lines of the article) and a link to the articles.
- The snippet used on the index page is also the snippet used for the description meta-tag in the article. Therefore, the snippet (first few lines of the article) is used three time:
* 1. On the index pages.
* 2. In the descrption meta-tag.
* 3. In the article itself.
(PM me and I will send sample URLs showing how it is setup.)
Could this cause our site a duplicate content penalty?
We have been dead in the water since Feb. 2nd. I also spoke with a webmaster yesterday who has a similar structured site and is having problems.
after seeing your site, on preliminary analysis, it doesn't look like its a dupe content issue, because you are not in the supplemental results.
the domain is over 5 years old, but it has a lot of links, did a lot of these links just get added recently?
how are you fairing in yahoo and msn?
Thanks for the help. Actually, our site used to have a ton more links. During the last link update we went from 4300 down to 1500. These are all natural links. We have never purchased any links. We have tons of people who link/make reference to the articles. With all the historical information we have, people use it to support their research and claims.
The funny thing I have noticed is that we used to have many more of the articles in supplemental results. Over the last few weeks, things have been moving out. Is this a positive sign?
In Yahoo, we are doing great. This is the main reason I cannot go and make wild changes to the site. In Ask and MSN, we get a fair amount of traffic. Ask has picked up in the past few weeks.
yes thats a positive sign.
and you should remember, google's link: command is for entertainment purposes only use yahoo's links: or linkdomain:.
I am having a similar problem.
My site is gone since August 02.
Now I see several supplemental results from my competitors but not for my site. I still don't rank but if I use the &filter=0 command, my site shows up on 1st page.
My question is. If I have no supplemental results on my pages, but mt site only ranks if I use &filter=0, am I beeing penalized or Google is still cleaning SERPs and my site will show up later?
Could please someone comment?
|"make sure the domain name isn't there. No links and boom, unique content." |
interesting idea there.
Um, I wasn't trying to give scrapers any ideas. Not that it's rocket science. They can also put a dictionary of words together and create random words on a page for unlimited content.
|do you have an evidence that google recognizes dupe content that links to the orignial, seems like lots of blogs quote and link to orignial blog posts with no problems. |
Evidence for a courtroom, no. Just used to notice anecdotally lots of scrapers with the same pattern. Title with a link to the original, although often a redirect in order not to pass pagerank. And lately I've noticed Google caught onto the pattern and stopped those sites. But now I've seen several where there was no link and Google did NOT catch that pattern.
Keeping this on topic...
How about a vote:
If your pages are listed as "supplemental" - is this a good trigger that you are under some type of duplicate content penalty/issue?
I recently heard that identical meta descriptions across an entire site will be considered dup content.
Since I had an identical mission statement for every meta description and there was no quick way to rewrite them all I added the page title to every meta description making each unique. I'll have to do a better job later but with Google banning with a slash and burn mentality quick and dirty solutions are a necessary expedient. Which will probably be called "gaming the system." You can't win, you can only make marginal gains this way.
This is the first time I've heard that G uses meta descriptions but I noticed that a huge batch of very different pages were lumped together as "Supplemental Results," probably because of this incidental duplication.
I don't know about others but I do know Google is leaving me much less time to create original content because of all its secret rules about dup content.
"I recently heard that identical meta descriptions across an entire site will be considered dup content."
i doubt this, but where did you hear it?
|i doubt this, but where did you hear it? |
Several forum posts returned on searching "Supplemental Results." At least one was a WebmasterWorld post, though I doubt I could find it again.
I usually take things I read on forums with a grain of salt and require some other confirmation--which I seemed to get in this case.
On a search for site:mydomain.com I got three results and the message "In order to show you the most relevant results, we have omitted some entries very similar..."
Those three were one with my standard description and two others where I had departed from my standard and used a unique meta description.
I reason that if all the descriptions were unique or at least began unique, there would be more results shown. Well I made that change, but it's not indexed yet so time will tell.
|I recently heard that identical meta descriptions across an entire site will be considered dup content. |
When I first started my site, I did this, thinking that since the site is about "xyz", then I didn't see the problem with the global meta description being "xyz". I started the site in July, ranked well in March, and got busted during the Bourbon updated on May 20th. Although I did lose all traffic from Google, I was never in any supplemental results. To take a precaution, I did delete those meta tag descriptions. However, I still seem to be affected by Bourbon. Somehow I think its over for that site.
hmmm that observation about meta tags is interesting...
let me look at a few things and see if i can cooaberate it.
yeah I can't see anything like that,
if someone wants to sticky me an example perhaps, that would be cool.
I honestly don't think meta tag descriptions have anything to do with it. Partcularly because I rank well in Yahoo and Msn.
But I just gave you an example of how it may not affect a site, because I certainly am not in the supplemental results.
I've only seen one example, of it a meta tag dupe penalty.
but it is pretty convincing.
I think Google smacked my site with a duplicate content penalty for matching the meta descriptions. When I built my site, it was generally accepted that the SEs didn't pay attention to the meta tags for keywords and description because they were so easily manipulated. So, I created a catchy phrase to describe my site in general, just so I'd have something there.
In December, 2004 my site got hit hard by Google. At first, 90% drop in traffic, then it eased somewhat to just 75% drop. *SIGH*
People kept telling me to check for duplicate content, and that was all I could find. So, I deleted most of the description tags, and the ones I left I wrote new ones specifically describing that page. The result? My site has come back somewhat, traffic is still about 50% of what it was, but I changed nothing else.
This may have had nothing to do with it, but the SERPs were bad in Google until I made the change. Yahoo/MSN were unaffected (so far).
around december 16th?
I'm hearing a lot of people talk about getting hit with dupe content penalties around then.
and again with burbon, around may 20th.
This dupe content business is a real problem for some of us who have old sites. It used to be fine to give someone permission to reprint an article from your site. It was a good way to get more people interested in your site through the link back to you. I recently dumped a whole section of one of my sites as some of the articles appeared here and there around the net. Some of these sites are no longer kept up so there was no way to contact the webmaster to remove it. What a pain.
Also how sad we can't set up printer friendly pages for people who like to just print out the article.
As far as a penalty for anything as small as a paragraph. That seems strange as it is legal to quote a paragraph as long as you referance the source. Sure I get tired of scammers doing it but I appreciate when an academic site does it.
The only time I've been hit by a dupe content was during Bourbon and that was a small site of mine that was 301 hijacked. In that case the whole site was brought down but is fine now. Is this always the case or sometimes are individual pages penalized without the penalty affecting the whole site?
quotes are obviously fine - articles aren't
I think that it's actually a common sense issue and for a lot of people while they are worrying about it they could be getting on with writing orignal content. Not a criticism just a suggestion.
Why would anyone want to use unoriginal content? Do what hacks do, summarise someone elses work and source it then add your own comments. Bingo a well sourced orignal page!
|Why would anyone want to use unoriginal content? Do what hacks do, summarise someone elses work and source it then add your own comments. Bingo a well sourced orignal page! |
Content content content...... if your site is bread recipe how many original ways are there to bake bread.
Even if I have a totally new way to back bread and I start my site with.
"Welcome to my bread baking site, we have some great and original ways to bake bread we're going to share with you"
How many sites my start out that way and unless I search and read thousands of sites I have no way of knowing if I have duplicated or nearly duplicated someone elses opening paragraph.... should I be penalized for that?
As a mentioned in another post my site is 4 years old 100% original (to the best of my knowlege) and my google position which was 1 through 5 for my best key words/phrases is now non-existant. Unless I add filter=0 which places me where I had been for tha past 2+ years.
I don't know that the dup content filter is as bad as made out. One of the leading sites in my sector has scraped complete content of a canadian govt's site's safety pages on this particular product. Pics, text and all and they seem bulletproof. In the top five for most of the search terms and #1 for some.
Of course, they also use a hidden link setup for their supposed link out for their competitors and hidden keyword text. I've reported them for spamming twice. No action in 4 months. Even used the "gilligan keyword". go figure. I don't report anything anymore. Google doesn't really care. They have their favorite target algos.
|Also how sad we can't set up printer friendly pages for people who like to just print out the article. |
Use rel=nofollow to blind Google to the "duplicate content" on "made for printer" pages.
| This 76 message thread spans 3 pages: 76 (  2 3 ) > > |