homepage Welcome to WebmasterWorld Guest from 54.196.189.229
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
Forum Library, Charter, Moderators: phranque

SEM Research Topics Forum

    
Duplicate Content? How is it determined?
Need your opinion or expertise.
wariental




msg:817966
 6:37 am on Mar 8, 2006 (gmt 0)

Ok so I know duplicate content is penalized. Lets say I have site A and site B on two different urls , two different hosts, two different IP addresses. I have an article on A and on site B I have the same article with a few extra words or a few words substituted.

Now does a search engine bot pick up on this? Or is duplicate content simply a word for word manifest?

 

slawski




msg:817967
 6:58 am on Mar 8, 2006 (gmt 0)

On one level, sites that are mirrors of each other may be identified when sites are crawled, and a choice may be made not to index some mirror sites.

During the serving of results, comparisons of snippets and titles may be made based upon fingerprints surrounding the words around queries, and some results may be filtered from search results.

There are a couple of patents from Google on duplicate content that describe approaches that Google is likely using. Chances are good that similar methods are employed in other search engines.

wariental




msg:817968
 7:39 am on Mar 8, 2006 (gmt 0)

can you link some?

slawski




msg:817969
 7:51 am on Mar 8, 2006 (gmt 0)

Here are the two Google patents on duplicate content:

Detecting duplicate and near-duplicate files [patft.uspto.gov]

Detecting query-specific duplicate documents [patft.uspto.gov]

These were granted over two years ago, and filed with the USPTO more than five years ago, so it is possible that if Google ever used them that they may have moved on since then. Regardless, they both show methods that appear like they would be effective.

Marcia




msg:817970
 7:54 am on Mar 8, 2006 (gmt 0)

Google patents:

Detecting duplicate and near-duplicate files [patft.uspto.gov]

Detecting query-specific duplicate documents [patft.uspto.gov]

Stanford paper, authors N. Shivakumar and Hector Garcia-Molina:

Finding near-replica documents on the web [dbpubs.stanford.edu]

wariental




msg:817971
 8:01 am on Mar 8, 2006 (gmt 0)

Thank you.

I wonder to what extent a "fingerpring" of a document is taken. For example.

A website has the opening sentence.

Welcome to my site.

Another

Welcome to my website

Another

Welcome to this website

etc.. Those all could basicly have a finger print?
Would that all be considered duplicate content or near duplicate content?

I guess no one really knows how far this kind of system goes.

Dave_A




msg:817972
 2:18 am on Apr 7, 2006 (gmt 0)

Hi my search engine notes duplicate content as it indexes a website, the results show where the content is duplicated and how close the match is.
The spider does this as the results come back to the search engine.
We often get pages of duplicate results, some are where a site has simlar content (75% content match) but some are just duplicate content hosted on a related website (Owners name or details the same)quite often the domain names are simlar or versions of the same domain like a dot net and a dot com site.
One webmaster who shall remain nameless registered ten domains with simlar names and they were all hosted on the same server, the content was the same down to the files sizes and dates, the only difference was the links pages (They all pointed to themselves) YAWN!
Thing is that if we were to index them all they would still score the same and show in the same results so does he expect the visitor to go to all his sites or just some of them?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / SEM Research Topics
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved