Duplicate Content Penalty Questions - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate Content Penalty Questions

For Google Search Engine

blend27

7:19 am on May 18, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

1. Lets Say I have a page A and page B, both dynamic content based on the products.

What is the percentage of content including images, text, HTML tags usually considered the to be enough to trigger Duplicate Penalty by Google?

100% of Page A = page B
90% of page A = page B
80% of page A = page B

2. If [site1.tld...] and misspelled version
[site1.tld...] get indexed with the same content would be it considered as a duplicate Content,

Same questions go for the dynamic variables in URL string

[dom.tld...] and
[dom.tld...]

Will this Also trigger Dup Content and what is the fastest way to get it of the site,
Return 301 or URL Removal Consol with 410 or create sitemap pointing to the duplicates, considering site map on site or on a separate domain. Also if a competitor decides to do a doodoo and places a link from some foreign site page to a page that reads [dom.tld...] by adding some nonsense to the end of the string like &category2=dupecontent so if spidered (and it will and be in the index) like this [dom.tld...] will both pages get duplicate content attribute assigned to them by Google.

I am very surprised that these days its not just a websites we build but Castles where heavy Armor is a must. The reason I am asking is that I found a few URLs pointing to my site that in reality I would never ask for, in some Guestbook Scripts and Blogs where someone left trail urls for no one knows what purpose.

Thanks for your input.

Blend27

blend27

12:21 am on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks

alexo

12:37 am on May 19, 2005 (gmt 0)

10+ Year Member

hello

it's really an important question/

What is the percentage of content including images, text, HTML tags usually considered the to be enough to trigger Duplicate Penalty by Google?

1 paragraph
1 sentence
1 whole article
?

hutcheson

3:40 am on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'd bet any amount you care to name that it has nothing to do with percentage.

Try to picture the scene in a Google conference room. A dozen of the best and brightest PhD mathematicians are seated a table. One says, "gentlemen, we're really getting beat up over these spammy duplicate pages. What can we do about it?

Beavis says, "let's compare every two pages on the web. That's only 100 quadrillion comparisons."

BH says, "heh-heh, that's a lot"

Beavis sais, "any pages more than 85% identical will get whacked."

BH says, "heh-heh. not enough."

Beavis says, "all right, 87%"

BH says, "heh-heh, OK."

Or, an alternative proposal:

"Gentlemen, this is Dr. Zarkon, who wrote a book on digital fingerprint analysis for Mung the Morbit, father of Ming the Merciless. He's come back to the 21st century for political asylum. He has some ideas about using Eigenvector transforms to calculate a discriminator. And this is Dr. Who, from ... (where?) ... anyway, just off a project building DNA distance spaces for all known mammalian species, to discuss the application of Gauss-Hilton spaces to long-string matching. We're hoping to find a way of combining both algorithms to get the advantages of both...."

OK, the truth is probably somewhere in between. But never assume your enemy is as ignorant and unimaginative as you are. And ... from all evidence, the Googletechs are closer to Dr. Zarkon than to Beavis.

Nuttakorn

4:46 am on May 19, 2005 (gmt 0)

10+ Year Member

If the text content is the same but different in the position and amount of content, Site A might have full of content but Site B might have a few context but the contents are the same. It also different in website structure and design.

Another case , If you get the contents from many website and mix them together in a page, How about this case? Will Google penalize the website?

arran

6:45 am on May 19, 2005 (gmt 0)

10+ Year Member

Another case , If you get the contents from many website and mix them together in a page, How about this case? Will Google penalize the website?

Hopefully

msja

8:07 am on May 19, 2005 (gmt 0)

10+ Year Member

Another case , If you get the contents from many website and mix them together in a page, How about this case? Will Google penalize the website?

What you are describing is a scraper site and it's definitely SPAM. Almost all scraper sites I come across have AdSense ads on them. Google should ban those sites and have their AdSense account terminated.

hutcheson

8:18 am on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There are two approaches I can think of (and I'm NOT a Googletech, and might have overlooked something). One of them would detect cases like that, and the other one wouldn't.

Neither one of them corresponds to how I generally find duplicate content by hand, and (as I've already suggested) a single correlation factor between 0 and 1 MAY be the final result, but even then it would likely not be recognizable as a "percent".

midlifecrisis

10:08 am on May 19, 2005 (gmt 0)

10+ Year Member

Obviously G can't directly compare each page in the index with each other. Given that duplicate content often goes hand in hand with datafed, affiliate-type sites my take on this would be a two step solution (if I were G):

1. Weed out the simpletons: Make a hash value of page title, url (directory and filename w/o domain) etc. Easily done during spidering / indexing. No matter if from the same or a different domain the same hash value(s) is a very good indication for a duplicate and would warrant further inspection. That would also be an easy way to catch all those datafed "Make-Your-Own-Amazon-Store" scripts. (That and the distinctive mod_rewritish URL format).

2. The actual content of a page (in this context the "pure" text without navigation, formatting etc.) would have to be treated much the same way: Build hash values of sentences or even paragraphs, same hash value might be a duplicate. (Obviously an easy way to combat this would be to randomly insert words into the textual content of the page or perhaps to introduce slight variations in spelling.)

Now what actually classifies a page as the duplicate of another page in terms of percentage of identical content is debatable. With regard to the textual content only I would set the threshold at about 80-90%.

reseller

10:40 am on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

msja

<What you are describing is a scraper site and it's definitely SPAM. Almost all scraper sites I come across have AdSense ads on them. Google should ban those sites and have their AdSense account terminated.>

And why should Google do that?

Google is interested in generating advertising revenues, not playing the policeman of the web.

Allow me to explain more....

If a surfer land on a page with contents which doesn't meet his/her expectations, then there is a high possibility that the same surfer shall click on the relevant AdSense spot(s) available on that page.

blend27

10:52 am on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

reseller, I am actualy not interested in/about Scrapers or Adwords

ncgimaker

11:03 am on May 19, 2005 (gmt 0)

10+ Year Member

Bayesian, or CRM114 with insufficient human interaction.

larryhatch

11:04 am on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There must be some fast efficient algorithms to determine duplicate content.
How does Copyscape work? They come back with scraped content pretty darned fast I think.
Copyscape doesn't have the resources of Google, one of which is CACHED copies of most of the web!

Given that, G should well be able to find duplicates.
The question is what they do about them. -Larry

ncgimaker

12:39 pm on May 19, 2005 (gmt 0)

10+ Year Member

I think Copyscape just uses search engines at the back end.

"Given that, G should well be able to find duplicates. The question is what they do about them."

I would suggest this approach:

(snip)

I was going to suggest in detail what I would do, but, never mind. Its not in my interests to make suggestions to G anymore.

arran

12:55 pm on May 19, 2005 (gmt 0)

10+ Year Member

There must be some fast efficient algorithms to determine duplicate content.

I'm confused - are you saying that Google doesn't already use duplicate content detection? I strongly disagree. Surely, the debate is about _how_ they determine duplicate content e.g. Word occurences, substring matching, html or text only, bayesian methods etc.

blend27

1:04 pm on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

So, Any Ideas on

--------------------------------------------------------
2. If [site1.tld...] and misspelled version
[site1.tld...] get indexed with the same content would be it considered as a duplicate Content?

Same questions go for the dynamic variables in URL string?

[dom.tld...] and
[dom.tld...]

--------------------------------------------------------

arran

1:12 pm on May 19, 2005 (gmt 0)

10+ Year Member

The http protocol is case insensitive so this would make no difference.

blend27

1:25 pm on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

-The http protocol is case insensitive so this would make no difference.-

I just Found a cached link in Google to a page called

www.dom.tld/Privacy.html
and
www.dom.tld/privacy.html

do these pages count as a duplicate content?

mrMister

2:04 pm on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The http protocol is case insensitive so this would make no difference.

The HTTP protocol is irrelevant. We're talking about URIs

URIs are case sensitive.

sja65

2:07 pm on May 19, 2005 (gmt 0)

10+ Year Member

If I were writing a search engine, this is how I would code a
duplicate content checker.

Take a page and strip all html, javascript and other data that
isn't directly viewable by a person browsing the page.

Group the remaining text into logical chunks (paragraphs,
cells of tables, etc.)

Store hashes of these logical chunks associated with the url
of the page. (Use something like MD5, but for these
purposes, you could use an even smaller hash)

When you have hashes for several urls from a site, you can
discount hashes that appear on more than x% of the pages,
or on a set number of pages. This would probably be a
fairly low number, as any of these hashes would likely
be navigational text, copyright notice or other template
driven text.

Now you can search all urls for duplicate hashes. If a page
doesn't have any unique hashes, it would be a duplicate
page.

If a page has duplicate hashes from multiple sites, it is
a scraper.

An algorithm like this would be easy to write, not very
computationaly intensive, and fairly accurate, so there
is no reason Google isn't doing something like this. There
is also no reason that Google can't weed out the scraper
sites.

reseller

2:16 pm on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

sja65

<If a page has duplicate hashes from multiple sites, it is a scraper.>

No please don�t do that (:(

I have permissions of several marketers and SEO specialists to host some of their articles on pages on my site. Some of these kind authors have the same articles published on pages on their sites too.

Does that make my site a scraper?

ncgimaker

2:28 pm on May 19, 2005 (gmt 0)

10+ Year Member

sja65,

Your algo assumes that the scraper grabs a unit (paragraph block etc.) that exactly matches the one used to make the hash and that no changes are made. Even putting the word "Extract:" in front of the paragraph would defeat it. Trimming to N words would defeat it etc.

Interesting, but probably quite limited in what it would detect.

sja65

2:32 pm on May 19, 2005 (gmt 0)

10+ Year Member

reseller wrote
I have permissions of several marketers and SEO specialists to host some of their articles on pages on my site. Some of these kind authors have the same articles published on pages on their sites too.

Does that make my site a scraper?

I meant if you have content from multiple sites on the same page. If my page on site A, has multiple pieces of content, one of which is from siteB and another is from siteC, I would classify it as a scraper. If siteA, siteB and siteC all had the same content (as in your situation), I would call it duplicate content. In both cases you need to be aware of the time the content was first seen in the wild to avoid penalising the site where the content originated.

midlifecrisis

2:37 pm on May 19, 2005 (gmt 0)

10+ Year Member

Would sja65's first post (#20) qualify as a dupe of my post (#9)? Basically same idea but different wording.

aris1970

4:49 pm on May 19, 2005 (gmt 0)

10+ Year Member

I think that you have forgotten to mention the quality of the site containing duplicate content (quality considered by G).

Our business authoritative site hosts about 20,000 pages of "duplicate" content (corporate press releases) but every time you search for a pr title, you will find it #1 on Google (above the real company-author of it).

I think that overall site quality is the most important factor for G before considering to penalty any site.

blend27

5:16 pm on May 19, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

this is msg #:26

-- quality is the most important factor --

well I have about 1200 pages of content that I added in the past 2 years, and almost everyone of my competitors would die to have a site like mine. There is only 2 other websites that have something closer, one of them is what inspired me to to redesign my site last year(which make me think why i am in the sandbox (Google Traffic = 0, Google Spider Traffic = Home Page and all the maJor pages are visited every day, sometimes twice a day, and entire site gets recrawled bi-weekly. all product pages are TPR3 or 4, the rest are 2), which was better than a lot of other sites anyway.

this thread so far:

1. threshold at about 80-90%
1. URIs are case sensetive - if so, is it a dup. content?

anyone else would care to jump in?

aris1970

5:52 pm on May 19, 2005 (gmt 0)

10+ Year Member

...almost everyone of my competitors would die to have a site like mine

Dear blend27, maybe unfortunately maybe not, Google does not use the above criteria to measure the quality of each site.

Lovejoy

6:05 pm on May 19, 2005 (gmt 0)

10+ Year Member

I just had the weirdest thing happen, I orignally had a free page, which through seo and old age made it to the top of google, yahoo etc. When bandwidth exceeded what I was allowed on the free page I purchased a domain and hosted it using the same index page, but left the old site up with just the main page because it was #1. My new site was picked up rather quickly and went up to #6, then got dumped for duplicate content, the old page remained #1.
I changed the main page on my new site completely, changed the text, links and images,etc... but nothing seemed to get my new site passed page 300 :~) on Google. This passed weekend I change the page title on my new site by one letter ( that's right One!)and it's #4.