Duplicate Pages

Forum Moderators: open

Message Too Old, No Replies

Duplicate Pages

the timeless question, but i didn't find much in site search

2_much

1:20 am on Oct 23, 2002 (gmt 0)

So with the latest Google News I imagine Google has some pretty advanced duplicate detection technology, or else they'd be serving the exact same story many many times.

I was wondering how different a page has to be so that it's not seen as a duplicate.

1. Does the TEXT have to be different? If so, approx what percentage?

2. Does the CODE/TEMPLATE have to be different? If so, how much?

Are both of these read in conjunction, or does Google check for both?

I guess my question is..If I have a site whose content is relevant for another site, can I take the same text and put it on the other template? Should I change the text a bit?

And if I copy a template and write new text for it, is that enough to make it appear unique, or should I also change the template?

Sorry for all the questions, but this issue has been bothering me for days and days. Any suggestions would be appreciated.

mack

8:21 am on Oct 23, 2002 (gmt 0)

Just wanted to give this one a boost back up because I think this is a very interesting point.

I have always asumed that we where over reacting a bit on duplicate content issues, but as 2_much said Google have certainly made some ground with google news in that there are very few pages with the same or even similar content.

Is this something that is going to migrate into the web search algo?

2_much

12:18 am on Oct 24, 2002 (gmt 0)

Hey thanks Mack but your bump didn't work either.

Anyone? Help? Please...Pretty please.

Sasquatch

12:33 am on Oct 24, 2002 (gmt 0)

I think it is 3 different pieces.

1. recognise the template and remove it. I do not expect them to totally remove the template in web indexing, but I expect they would devalue it.

2. Find all the same news pages (AP, Reuters, etc.) and clump them together. That is the easy part.

3. Use statical analysis to the wors used in the articles to find similarities. Certain word combinations will only apply to specific stories.

All three of these would be much easier to do on news sites than on random web pages.

WebGuerrilla

4:34 am on Oct 24, 2002 (gmt 0)

You don't need to worry about templates. You only need to worry about content that gets parsed and stored.

If you change up your file names and linking structures a bit your copy can be pretty close without causing problems.

espeed

4:39 am on Oct 24, 2002 (gmt 0)

There are several methods for detecting similar content, but I believe Google is using a method of counting the number of chunks (sentences or paragraphs) common in pages. You can do this by converting pages to plain text, stripping out the HTML tags, and then chunking the page into sentences or paragraphs. Then each chunk is hashed down to a 32-bit fingerprint. If two pages share more than some threshold of chunks with identical fingerprints, the pages are identified as similar.

2_much

4:49 am on Oct 24, 2002 (gmt 0)

Can you guys refer me to some reading about this techonology?

I'm techologically impaired and none of this makes sense ;)

Sasquatch

5:21 am on Oct 24, 2002 (gmt 0)

Can you guys refer me to some reading about this techonology?

You ask a dangerous question.

for recognizing the template parts, it would just be simple parsing and compares. HTML makes this easy since there are well defined sections of each file. Navigation elements tend to be fairly repetitive, and verbose.

I would recommed reading the parsing section of any good text on compiler design.

For finding duplicate content, you would then look for similar chunks of text. I think a lot of people would make this more difficult than it would have to be. I'm guessing that they only go after pages that are almost exact duplicate.

For finding similar news stories, and such things as themeing you might want to do some reading on linguistics and 'bayesian analysis'.

Duplicate Pages

the timeless question, but i didn't find much in site search

2_much

mack

2_much

Sasquatch

WebGuerrilla

espeed

2_much

Sasquatch

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week