Forum Moderators: open
But, Google can't tell a close dupe from a similar page. I have pages which are 99.9% similar that it likes and indexes, and pages that are 80% similar that it refuses to index separately.
How the heck it is actually doing this is anybodies guess.
My theory is why worry? Throw them all into the cooking pot and most will come out tasting good :)
Also Google would not entangle itself much in this thing when there are 1000s of Spam techniques needs to be addressed first on priority List.
People are smart enough to just simple not copying the content as it is rather
1) mixing copied with some orignal text
2) changing/tweaking language here and there.
3) partial copying etc etc
As far as Dupelicate Content Penalty is concerned, Google may impose them incase this happened within site or on same IP, assuming both belong to same Webmaster.
I was also tracking a site of my Compititor who created 300 pages with same Content, except title tag. Once it ranked very well for all 300 Pages but then after a month whole site was outta index.
I can tell it if somebody sticky me.
Edited - added the last lines :-)
I submitted SPAM and copyright complaints to Google. I finally got the site offline, by going to the web hosting company... but Google still has all of these copied pages in the index.
If you follow the instructions on submitting a DMCA complaint (has to be by fax or post), Google will definitely remove the copied content from the index.
One client had a .co.uk and a .com site that were aliased at the server. One morning they woke to find that the .com site was out of the index but the .co.uk site was fine, being a UK company. They deleted the .co.uk site and found that they were completely unreachable although the .uk pages were still in the index. It took an email to Google to get that sorted and now they only use .com.
So today I faxed off a DMCA complaint to Google. They want you to list every search phrase that shows the offending pages in the results, and they want you to list every page that has copied content. I listed a couple, then told Google to do a Google search for "site:mysite.com widget", which shows the over 430 pages from my site that were copied in their entirety (including a press release announcing my site starting up!).
Man... what a pain. This is unreal. Due to the extensive Google spamming that this guy does, he has apparently given Google the impression that his pages are the original, and mine are duplicates. My PR dropped from 6 to 3, then came back to 4. His copied versions of my pages appear above my pages for every possible search phrase.
It certainly has me thinking more about what I can do to protect my content. I guess I need to implement some filters to block automated web bots like wget, or perhaps serve my pages dynamically so they can not be so easily copied (does this help?)
Any tips on preventing this kind of wholesale web copying would be greatly appreciated.
Expect less from Google . . . less in the way of duplicate results!
Thanks to some engineering wizardry, we've dramatically reduced those pesky duplicate entries. This means better results returned with each search query.
Another improvement you may notice is a reduction in the number of returns from a single site. This means even if there are thousands of relevant pages on a single computer, you'll only get the first two, plus a link to "more results from host.com". In the old days you might have waded through multiple pages of results from one machine before getting to the next entry. Try searching on "java" and you'll see why this is so important.
The <filter> parameter causes Google to filter out some of the results for a given search. This is done to enhance the user experience on Google.com, but for your application, you may prefer to turn filtering off in order to get the full set of search results.When enabled, filtering takes the following actions:
Near-Duplicate Content Filter = If multiple search results contain identical titles and snippets, then only one of the documents is returned.
Host Crowding = If multiple results come from the same Web host, then only the first two are returned
The above two entries proves that Google has the ability to detect duplicate page content. But this also states that duplicate pages are ranked.