homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 44 message thread spans 2 pages: 44 ( [1] 2 > >     
Google raised their content duplication detection bar on this update
Something new on this update, may help you if your site not showing up yet...

 7:39 am on May 13, 2003 (gmt 0)

I am a foreign language SEO and over the years, more and more of my clients are paying more attention to google and their English Websites being spidered by this powerful SE. I have been studying google for some times now, and I have noticed something new on this update...

There have been quite a few of my clients disppeared from the sj, www3 sites... Although the update has not yet started (finished) yet, I strongly believe their site will not be included in this update. So, I digged in to find what's causing it. Over hours of exploration, I have found one similarity among these troubled clients' sites... They seems all have "copied" some contents from else where... However, some of them have the permission of the original author of the articles (mostly new scientific research articles)... Over the past months, I have mentioned to their web editors to make sure not directly using the contents from other web sites, and they have modified some of the content and layout quite nicely. And they have had no problem with Google. Until this mysterious update... IT SEEMS THEY ARE CAUGHT BY A NEW FILTER! So, my guess is google has spent quite some time to update their algo, and this new algo contain a much powerful content duplicate detector...

For those of you still can't find your sites on www3 and sj, maybe you want to make sure you do not have duplicated contents even they are copy right "legit" and HAD no problem before...

Please do not flame me, this is only my theory, it may or may not be true.



 8:31 am on May 13, 2003 (gmt 0)

Hi sohu8976

An interesting point can you quantify "copy" are we talking a number of keys phrases, 2 - 3 paragraphs...




 6:20 pm on May 13, 2003 (gmt 0)

Hi Rich:

If you search on this board, you may find hundreds of discussions over quantity of dupplication that may get you banned, and no one can provide an exact ratio. Due to the "banning the site, instead of pages" SPAM policy, the price is just to high to test things like this. GoogleGuy obviously will not reveal this inside secret to the outsides either.

However, according to my experience, Google has been quite generous regard to this issue UNTIL this update... Seems they are tighting it up. For example, before, if you have 10 pages out total of 100 pages on your sites that have duplicated contents from elsewhere, it's 10%, you are safe. However, you may get kicked by this algo (the new algo), I still can't tell you how much they have tighted up the policy, but I am almost sure they have done something to accomplish a more restricted content duplication policy. :)

Maybe we can wait till this update finishes, and then do a poll to check if this is true... So far I have heard an increased amount of webmasters mentioned their sites have "vanished" from the www3... I think this may be one of the causes...


 7:19 pm on May 13, 2003 (gmt 0)

Sohu, interesting discussion :)

Isn't there one fundamental flaw with your theory? How could Google possibley tell which is the site with the original content on (which other sites have copied)?

If Google penalised the sites that copied then surely there would be a very good chance the original site would get penalised too?

If this was the case then it would be oh-so easy to reduce you competition - simply set up a website and copy all of their content, sure this site will get penalised, but so will all your competition. If you theory is true then how would Google get around this?

Does this make sense?



 7:28 pm on May 13, 2003 (gmt 0)

Hi Chris:

Thanks for joing this topic. Well, one thing for sure is google and many other SEs must have "dup. content fitlers" installed in order to show the audience clean results. Otherwise, everyone can just copy/duplicate the #1 postioned site for the particular keyword. Now, how does google deferentiate which is original or not, they obviously has their own way, but a simple way to do is to see which page has a higher PR or which site has a higher rank ( they may keep a database for credits of sites on the SE, like the credit center of our bank accounts etc... )

:) SoHu


 7:46 pm on May 13, 2003 (gmt 0)

Hi Sohu, thanks for the reply. I'm not saying your theory is right or wrong, I'm just interested! I'm sure your right there are some sort of "dup. content fitlers", the question is to what extent do these filters function.

If they are penalising it seems to be somewhat random, unless they haven't fully kicked in. For example I pasted some random content from amazon in to google search.


I could give you endless amounts of duplicated content that google is serving up from many different sites.

So where does that leave us? (I'm scratching my head! ;)


 8:00 pm on May 13, 2003 (gmt 0)

Hi Chris:

Thanks for the url, it's a very good example. My guess is goole's dup. fitler must have some detailed rules, such as all contents on amazon can be duplicated? :D Book names, song names, people names, descriptions etc... all can be duplicated? Hehe, any experienced SEO on this board would like to share their oppions?

Once Again, let's pick GOOGLE's brain... :p


 8:11 pm on May 13, 2003 (gmt 0)

hi sohu8976

Think your theory is right - already seen it happen prior to this update. Google seems to recognize 'snippets' and removes all urls from it's index that contain that copied content. Only the original source (URL) remains in the index. I'm afraid none of those urls will return (until now at least). My urls where gone about 4 weeks ago and haven't returned yet nor in www2 or -sj.


 8:50 pm on May 13, 2003 (gmt 0)

I'm not at all sure how extensive the Dup content filters are. I use a page with a single variable and another site links to these pages based on a number the variable. If my page calls it "template.asp?var=Widget" the other site tends to call it "template.asp?var=WIDGET"

The pages are identical depending on the time each is called from my server but they are both in the Google database. It actually has driven me quite mad.

So please tell me how this isn't caught by the dup content filter?


 9:06 pm on May 13, 2003 (gmt 0)

Thanks WebMeester for the confirmation. :)

taxpod: I think this has something to do with dynamic contents... I once duplicated an entire site for testing purposes using PHP, and it was copying content from all over the places, it was for testing purposes, so, I was doing all kinds of illeagal stuff on it... Believe or not, since it was PHP, google acturally did not mind, not even those hidden links. :p

(I am not suggesting people go ahead and do these things, and I am sure Google has improved since then, so, do not test your site like this. :))


 9:17 pm on May 13, 2003 (gmt 0)

I just checked a few sites I know are duplicates and some that are almost duplicates and they are all still in both -sj and -fi.


 9:21 pm on May 13, 2003 (gmt 0)

hrm... This is interesting... Then my guess is google is "targeting" a group of "duplicators" by their geographical location, content categories (music, movie etc...) or something... I am sure they can do things like this...


 9:29 pm on May 13, 2003 (gmt 0)

Now, how does google deferentiate which is original or not, they obviously has their own way, but a simple way to do is to see which page has a higher PR or which site has a higher rank

I don't think so, brother. They had better do it by a dated/cache system or they will have loads of complaints and a maybe a few lawsuits.

For example, I've got Copyrighted articles and marketing copy floating around that is included on our main site as well as on higher PR ones that picked it up later.

So, you think they have some right to placement of MY article from MY site just because their PR is higher? Again, I don't think so. Legally, I know so as Copyright law is quite specific. Google had better think about the ramifications, too...


 9:40 pm on May 13, 2003 (gmt 0)

Hi Shoestring:

It was just a guess of mine, obviously google should have some more sophysicated algos to do this kinda of tasks, like I said in other postings, they may have date, time and geo. location in mind (cache doesn not work, cause when the content of the page changes, cache get updated...), and they may also have some kinda credit ranking going (something we may not know). etc... This is only a guess. :)


 9:49 pm on May 13, 2003 (gmt 0)

I also agree that Google uses the oldest one they have indexed as the one that will stay, regardless of what the PR value is.

That said, you can also always file a DMCA complaint through Google if your content shows up on someone else's site in Google's search results. [google.com...]


 10:05 pm on May 13, 2003 (gmt 0)

Hi there, Can some one please tell me say if a site is banned for duplicating contents, and just give up the domain, start a new one fresh, can he/she still use the clean (non duplicated) contents from the old site? Doe he/she need to shut down (or delete these clean content from) the old domain?

Since the old domain is not spiderable (or not interested by goolge) anymore, it would be safe to duplicate some "clean" content out from the messed up one, right?

Any people? admins? GG?


 4:29 am on May 14, 2003 (gmt 0)

Nice article SoHu, I somewhat see your point and agree with you. Unfortunatly, I can't answer your question, maybe someone here is experienced enough to answer it. :)


 4:36 am on May 14, 2003 (gmt 0)


To answer your question: Dup content will not cause a site to be banned. However, the site will be removed from the index. If the content changes it will reappear. I know this to be the case because I accidently placed one of my sites index pages in the wrong folder on my server. Effectively I dup the two sites. As a result Google dropped one of the pages (the one with the lower pr). I noticed my mistake, corrected it, and the next month it was back in the index.


 4:42 am on May 14, 2003 (gmt 0)

Hi Alan:

I don't that will be case if google thinks the duplication is a vilation of copy-right...


 4:44 am on May 14, 2003 (gmt 0)

I see a few problems with this:

sites that present news from wire sources
sites that have press releases
sites that subscribe to the same content provider
sites that have any sort of affiliate program that lets them use pre-made reviews, etc.
sites that just plain have permission.


 4:52 am on May 14, 2003 (gmt 0)


Could you please clarify your points? Are you saying duplicate contents from the sites you listed will be a problem?


 4:54 am on May 14, 2003 (gmt 0)

sohu ... interesting theory, it seems it should be easy to find a bunch of "directory type" sites that just spit out open directories pages ... which would almost entirely duplicated ...


 4:55 am on May 14, 2003 (gmt 0)

had to make another post to hit 100!


 5:32 am on May 14, 2003 (gmt 0)

jbauder: your last msg should be considered as "spamming"... :)


 5:45 am on May 14, 2003 (gmt 0)

dididudu ...

if my message was spam then your username is duplicate content ;-)

Maybe we should send googleguy spam reports on each other


 7:15 am on May 14, 2003 (gmt 0)

I can confirm the theory: Had a site removed completely from the index right after I added loads of new pages that were copied from another website. The problem however, is that the "other" website was mine as well, I am now being punished although I am the copyright owner. When Google would have just filtered out the duplicate content it would have been ok, but now a complete site is banned which held other original content as well. I'm utterly annoyed, it's not spamming in my opinion but it is being treated this way :-(


 9:27 am on May 14, 2003 (gmt 0)

after reading this threads about duplicate content filter, I made some tests regarding to my detection with the "repeat the search with the omitted results included" link at the last keyword SERP
(I explained it in [webmasterworld.com...] )

also I have tested the results of a keyphrase, where I got very spammy results in the past.

this tests have shown: google omittes results with duplicate snippets. (seems to be a little bit strikter than in the past.)
this means: the pages are in the index. to be omitted is not a permantent penalty and could be an large effort for the google SERPs

but, if there is only a minimum of different text, for example an other order of keywords or one additional word: then the pages are in the SERPs - very hard SE spam, created with doorway pages :-(


 3:11 pm on May 14, 2003 (gmt 0)

if there is only a minimum of different text, for example an other order of keywords or one additional word: then the pages are in the SERPs

I expect this will be fixed in the near future. If Google is strictly removing dupe content sites, they will tweak it in the future.


 3:18 pm on May 14, 2003 (gmt 0)

what about duplication within a site e.g template pages with small variations between them? would this be considered duplication?


 4:48 pm on May 14, 2003 (gmt 0)

:( Still no ones answers my question... This supposed to be the biggest SE board on the net...

Quest Restate:
"if a site is banned for duplicating contents, and just give up the domain, start a new one fresh, can he/she still use the clean (non duplicated) contents from the old site? Doe he/she need to shut down (or delete these clean content from) the old domain? "

But it's good to see many of you support my theory. :) I sure google's new filter can help the searchers more (this is what they care anyways), but the spammy results on sj just don't cut it... I hope GG is aware of this.

This 44 message thread spans 2 pages: 44 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved