What determines duplicate content?

Forum Moderators: open

Message Too Old, No Replies

What determines duplicate content?

petehall

12:03 pm on Aug 11, 2004 (gmt 0)

I'm interested in people's opinions about this.

We run sites that are driven by data that people enter.

All of the data is unique, however 30 of the pages share similar titles with one or two keywords changing depending on the subject of the content.

Is this considered duplicate content? i.e. similar titles but completely different content?

It's presenting a bit of a problem really, as there is no other way to title the pages... unless we remove the main keywords! (below they would be "Buy" and "online")

e.g. the titles are along the lines of

Buy Brand1 online
Buy Brand2 online
.......
Buy Brand30 online

So should "Buy" and "online" be removed, leaving only the brand?

Many Thanks...

diamondgrl

1:50 pm on Aug 11, 2004 (gmt 0)

duplicate titles does not mean duplicate page. you're fine.

Lord Majestic

1:53 pm on Aug 11, 2004 (gmt 0)

I'd like to join search for answer: lets say some website A is adding datestamp to each otherwise identical page. So technically speaking checksums of two pages will be different, however fundamentally pages will be the same. Is there a definitive view that search engines are able to detect these mainly identical pages?

nalin

2:30 pm on Aug 11, 2004 (gmt 0)

...adding datestamp to each otherwise identical page. So technically speaking checksums of two pages will be different...

There is a unix comman "diff" that will report common lines in two files and get at the meat of what has changed. It would be trivial to detect similarities very quickly and easily using such a tool and line or percentile thresholds, and it would allow you to specifically find, for instance, the number of words present in the longest repeated paragraph. You may be thinking thats all well but what if I combine lines or something similar - the problem is that it would be too easy to write a script that encompases any workaround one could come up with*.

The larger unknown is how you get pages to test against each other (I would assume you group the results of common searches, but even this is a very large problem). Another question is how do you catch duplicate content that has (significantly) altered file formats - html vs pdf, or better yet html vs a binary format (a large jpg screen shot of the page for example).

*I use something similar to this to auto-sort mail into an appropriate IMAP subfolder at work - people create the folders and orginize their content as it suites them then a cronjob looks at similarities with the headers and creates filters for procmail. I get around something similar to the changed lines problem by replacing whitespace with linebreaks which ensures the analization successive words and thus ignore the length and compisition of lines.

Lord Majestic

2:39 pm on Aug 11, 2004 (gmt 0)

There is a unix comman "diff" that will report common lines in two files and get at the meat of what has changed

If only all web people were adding line breaks where appropriate! But point taken - line breaks can be forced for <br>/<p>s etc.

shri

3:43 pm on Aug 11, 2004 (gmt 0)

Funny you should ask this question... :)

[webmasterworld.com...]
[webmasterworld.com...] and
[webmasterworld.com...]

petehall

3:47 pm on Aug 11, 2004 (gmt 0)

Ok, so it's not the page titles that are the problem... which confuses me after losing 30+ number 1-5 positions for "optimised" phrases.

These have been strong for years.

The home page still ranks number 1 on one major two-word keyphrase, however the other two have slipped from 1-4.

There are some very odd results where ours used to be.

The traffic has not suffered much at all, as the site is very well established accross all search engines, however advertisers will soon notice the loss of positions...

shri

3:54 pm on Aug 11, 2004 (gmt 0)

One of the reasons could well be that Google seems to have devalued internal links / anchor text.

We've lost rankings on several internal pages which relied heavily on links from across the home page or global navigation.

petehall

4:08 pm on Aug 11, 2004 (gmt 0)

Funny you should mention that - are you suggesting the links to the problem pages should be made accessable from every other page?

I was going to do this originally, however you seem to end up with a lot of links on every page which is a bit of a shame!