Forum Moderators: Robert Charlton & goodroi
What is the percentage of content including images, text, HTML tags usually considered the to be enough to trigger Duplicate Penalty by Google?
100% of Page A = page B
90% of page A = page B
80% of page A = page B
2. If [site1.tld...] and misspelled version
[site1.tld...] get indexed with the same content would be it considered as a duplicate Content,
Same questions go for the dynamic variables in URL string
Will this Also trigger Dup Content and what is the fastest way to get it of the site,
Return 301 or URL Removal Consol with 410 or create sitemap pointing to the duplicates, considering site map on site or on a separate domain. Also if a competitor decides to do a doodoo and places a link from some foreign site page to a page that reads [dom.tld...] by adding some nonsense to the end of the string like &category2=dupecontent so if spidered (and it will and be in the index) like this [dom.tld...] will both pages get duplicate content attribute assigned to them by Google.
I am very surprised that these days its not just a websites we build but Castles where heavy Armor is a must. The reason I am asking is that I found a few URLs pointing to my site that in reality I would never ask for, in some Guestbook Scripts and Blogs where someone left trail urls for no one knows what purpose.
Thanks for your input.
Blend27
Try to picture the scene in a Google conference room. A dozen of the best and brightest PhD mathematicians are seated a table. One says, "gentlemen, we're really getting beat up over these spammy duplicate pages. What can we do about it?
Beavis says, "let's compare every two pages on the web. That's only 100 quadrillion comparisons."
BH says, "heh-heh, that's a lot"
Beavis sais, "any pages more than 85% identical will get whacked."
BH says, "heh-heh. not enough."
Beavis says, "all right, 87%"
BH says, "heh-heh, OK."
Or, an alternative proposal:
"Gentlemen, this is Dr. Zarkon, who wrote a book on digital fingerprint analysis for Mung the Morbit, father of Ming the Merciless. He's come back to the 21st century for political asylum. He has some ideas about using Eigenvector transforms to calculate a discriminator. And this is Dr. Who, from ... (where?) ... anyway, just off a project building DNA distance spaces for all known mammalian species, to discuss the application of Gauss-Hilton spaces to long-string matching. We're hoping to find a way of combining both algorithms to get the advantages of both...."
OK, the truth is probably somewhere in between. But never assume your enemy is as ignorant and unimaginative as you are. And ... from all evidence, the Googletechs are closer to Dr. Zarkon than to Beavis.
Another case , If you get the contents from many website and mix them together in a page, How about this case? Will Google penalize the website?
Another case , If you get the contents from many website and mix them together in a page, How about this case? Will Google penalize the website?
What you are describing is a scraper site and it's definitely SPAM. Almost all scraper sites I come across have AdSense ads on them. Google should ban those sites and have their AdSense account terminated.
Neither one of them corresponds to how I generally find duplicate content by hand, and (as I've already suggested) a single correlation factor between 0 and 1 MAY be the final result, but even then it would likely not be recognizable as a "percent".
1. Weed out the simpletons: Make a hash value of page title, url (directory and filename w/o domain) etc. Easily done during spidering / indexing. No matter if from the same or a different domain the same hash value(s) is a very good indication for a duplicate and would warrant further inspection. That would also be an easy way to catch all those datafed "Make-Your-Own-Amazon-Store" scripts. (That and the distinctive mod_rewritish URL format).
2. The actual content of a page (in this context the "pure" text without navigation, formatting etc.) would have to be treated much the same way: Build hash values of sentences or even paragraphs, same hash value might be a duplicate. (Obviously an easy way to combat this would be to randomly insert words into the textual content of the page or perhaps to introduce slight variations in spelling.)
Now what actually classifies a page as the duplicate of another page in terms of percentage of identical content is debatable. With regard to the textual content only I would set the threshold at about 80-90%.
<What you are describing is a scraper site and it's definitely SPAM. Almost all scraper sites I come across have AdSense ads on them. Google should ban those sites and have their AdSense account terminated.>
And why should Google do that?
Google is interested in generating advertising revenues, not playing the policeman of the web.
Allow me to explain more....
If a surfer land on a page with contents which doesn't meet his/her expectations, then there is a high possibility that the same surfer shall click on the relevant AdSense spot(s) available on that page.
Given that, G should well be able to find duplicates.
The question is what they do about them. -Larry
"Given that, G should well be able to find duplicates. The question is what they do about them."
I would suggest this approach:
(snip)
I was going to suggest in detail what I would do, but, never mind. Its not in my interests to make suggestions to G anymore.
There must be some fast efficient algorithms to determine duplicate content.
I'm confused - are you saying that Google doesn't already use duplicate content detection? I strongly disagree. Surely, the debate is about _how_ they determine duplicate content e.g. Word occurences, substring matching, html or text only, bayesian methods etc.
--------------------------------------------------------
2. If [site1.tld...] and misspelled version
[site1.tld...] get indexed with the same content would be it considered as a duplicate Content?
Same questions go for the dynamic variables in URL string?
--------------------------------------------------------
Take a page and strip all html, javascript and other data that
isn't directly viewable by a person browsing the page.
Group the remaining text into logical chunks (paragraphs,
cells of tables, etc.)
Store hashes of these logical chunks associated with the url
of the page. (Use something like MD5, but for these
purposes, you could use an even smaller hash)
When you have hashes for several urls from a site, you can
discount hashes that appear on more than x% of the pages,
or on a set number of pages. This would probably be a
fairly low number, as any of these hashes would likely
be navigational text, copyright notice or other template
driven text.
Now you can search all urls for duplicate hashes. If a page
doesn't have any unique hashes, it would be a duplicate
page.
If a page has duplicate hashes from multiple sites, it is
a scraper.
An algorithm like this would be easy to write, not very
computationaly intensive, and fairly accurate, so there
is no reason Google isn't doing something like this. There
is also no reason that Google can't weed out the scraper
sites.
<If a page has duplicate hashes from multiple sites, it is a scraper.>
No please donīt do that (:(
I have permissions of several marketers and SEO specialists to host some of their articles on pages on my site. Some of these kind authors have the same articles published on pages on their sites too.
Does that make my site a scraper?
Your algo assumes that the scraper grabs a unit (paragraph block etc.) that exactly matches the one used to make the hash and that no changes are made. Even putting the word "Extract:" in front of the paragraph would defeat it. Trimming to N words would defeat it etc.
Interesting, but probably quite limited in what it would detect.
Does that make my site a scraper?
I meant if you have content from multiple sites on the same page. If my page on site A, has multiple pieces of content, one of which is from siteB and another is from siteC, I would classify it as a scraper. If siteA, siteB and siteC all had the same content (as in your situation), I would call it duplicate content. In both cases you need to be aware of the time the content was first seen in the wild to avoid penalising the site where the content originated.
Our business authoritative site hosts about 20,000 pages of "duplicate" content (corporate press releases) but every time you search for a pr title, you will find it #1 on Google (above the real company-author of it).
I think that overall site quality is the most important factor for G before considering to penalty any site.
-- quality is the most important factor --
well I have about 1200 pages of content that I added in the past 2 years, and almost everyone of my competitors would die to have a site like mine. There is only 2 other websites that have something closer, one of them is what inspired me to to redesign my site last year(which make me think why i am in the sandbox (Google Traffic = 0, Google Spider Traffic = Home Page and all the maJor pages are visited every day, sometimes twice a day, and entire site gets recrawled bi-weekly. all product pages are TPR3 or 4, the rest are 2), which was better than a lot of other sites anyway.
this thread so far:
1. threshold at about 80-90%
1. URIs are case sensetive - if so, is it a dup. content?
anyone else would care to jump in?