Duplicates and the challenges search engines face - (deprecated) SEM Research Topics forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Duplicates and the challenges search engines face

Starting point for understanding how duplication is detected.

msgraph

4:50 pm on May 23, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There have been loads of threads on how search engines, specifically Google, try to detect duplicates or near duplicates.

Basically there is no clean answer. If you want to understand some of the technology or theories out there then you need to start reading. Sure you can take the easy way out and ask for examples and theories from site owners but what better way to learn than from those who study these problems for the search engines themselves.

A good starting point for the challenges is here:

Section 4. Duplicate Hosts

Algorithmic Challenges in Web Search Engines [internetmathematics.org]

published in Volume 1.1 Journal or Internet Mathematics by Monika R. Henzinger (Research Director - Google,Inc.) 2003

Follow and read every reference listed in that section and you will get a good idea of how duplication detection works and their challenges.

Note: This does not imply that Google currently employs any or all of these methods although I'm sure they use a large part of them.

The bottom line is that straight or very-near duplication, similar site structures, and similar sites hosted on the same server can be detected easily. When you start to get into paragraph/article duplication, things get fuzzy and detection is very very difficult, with the "determined" authority beating out the rest.

Liane

1:57 am on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Great paper, but they lost me at "eigenvectors of the corresponding matrices" ... blah, blah, blah.

I'm a yacht broker captain ... not a miracle worker!(Going to have to go back to school to undertake that paper a second time!)

Think I'll stick to reading WebmasterWorld. Its more than enough for me to digest and attempt to understand! :)

scorpion

3:10 pm on May 24, 2003 (gmt 0)

10+ Year Member

I think the question is, what IS duplicate content. Is it just that lots of sites are using the same words? For example, sites using affiliate programs tend to result in lots of web pages using the words of the affiliate. For example, Amazon's revenue sharing program tends to have lots of people using the words 'amazon' on their site. It's a real grey area I think.

msgraph

3:47 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>>but they lost me at "eigenvectors of the corresponding matrices"

But just ignore all that :) I'm worried that many see the math and just close the documents, when in fact, there are usually loads of info in easy to read descriptions. Well, if you are familiar with how sites are hosted and work.

The best tidbits of information to digest off of these papers are what can be achieved now and what cannot be achieved.

The duplicate host detection problem is easier than mirror detection since the URLs between duphosts differ only in the hostname component.

Of course, the hard questions are, what sketch to choose and how to avoid comparing all pairs of hosts. Since there are millions of different hosts, comparing all pairs is simply infeasible.

See? These types of papers point out that some limits have to be set on calculating things. Some of these comparisons take a huge load of resources and that's what many people do not understand. Google, and many other search engines, are not the all-seeing-eye that can detect every little item that breaks their rules. They might have the capability to do so, but then we would be having updates every 6 months to a year or so.

A perfect example is cheat codes for video games. If you type in the right search terms you'll get perhaps 200 results of almost the same cheat codes and hints with the only difference being the navigation structure of the sites. Why do these sites never get pulled? Because they are all on different hosts around the world and they have different linking structures. Now if you had ten sites with the same cheat codes sitting on the same server or host, a search engine could detect them in no time and penalize them as such.

Go60Guy

4:43 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

msgraph - I looked over the paper, and, aside from my eyes glazing over, I was unable to tell exactly what they were referring to in using the term "host". Do they mean that dups are easy to detect in a virtual hosting environment? What about the same host and different IPs for each domain?

Also, I would agree that this is a very grey area, and is confusing at best. For instance, many press releases, articles, etc. are widely distributed and/or syndicated. And, the idea that we all should be Pulitzer Prize candidates creating "unique, important, compelling", (you name it), content is clearly unacheivable.

I think, at one time, Brett commented that when it comes to boiler plate product descriptions, and the like, differing navigation structure and layouts are sufficient to avoid a duplication penalty.

Who really knows? I'm not even sure Googleguy knows.

skibum

5:59 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

It seem like one area where Google in particular could improve duplication detection is when its time to go look for a deal on something.

Searches for numerous products show the same content in many of the top results in some product categories on a number of different syndication partners. (i.e. shoppingengine.portal.com) You have to scroll at least halfway down the first page to find something that is not duplicate content.

That category is big enough that a manual filter could probably knock out a lot of that duplication from some of the major sites if there is not a programming work around.

msgraph

6:44 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>I was unable to tell exactly what they were referring to in using the term "host"

Yeah host is like "spam", one of those words that has several meanings. Heh, sometimes these papers get really confusing because they'll just label something host in different parts of the paper that relate to different "types of hosts".

What they are referring to here is host as in site, yet like I said above, they'll also use host as in a web server hosting multiple sites. Looking again at my cheat codes example above, I could have worded it better.

The duplicate host detection problem is easier than mirror detection since the URLs between duphosts differ only in the hostname component.

She could have worded this better because mirror detection defined in (Bharat and Broder 99) Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content [www8.org] is the almost same thing as in what she is referring to being easier to detect with duphosts problem:

Hence, we define two hosts to be mirrors if:
i. A high percentage of paths (that is, the portions of the URL after the hostname) are valid on both web sites...

So basically, from what I gather, she is talking about something along the line of dmoz clones, where sites are almost exactly the same except for different color schemes, logos, etc. Plus the urls are all the same like:

www.dmoz.org/Reference/Museums/History/
www.dmozclone.com/Reference/Museums/History/

Dmoz is probably the best example to look at when seeing how they treat duplication. Many of the sites and pages are still in the results; they just don't have any serious weight to them where they can rank well.

The paper I listed above in this post and this one: A Comparison of Techniques to Find Mirrored Hosts on the WWW (1999) [citeseer.nj.nec.com], explain fairly well what is considered duphosts and mirrors. You'll also find information on sites hosted on the same server, external linkage, and so on. Pretty good reading and they lay out great examples.

Skibum:

That category is big enough that a manual filter could probably knock out a lot of that duplication from some of the major sites if there is not a programming work around.

And I think that is all they can do with most of this duplication filtering; manually tweak the algo to treat special areas that have gotten out of hand.

Silicon

6:41 pm on May 25, 2003 (gmt 0)

10+ Year Member

if you search for "google and duplicate content" on G...the results are all duplicate content. =P

Mikkel Svendsen

6:07 am on May 26, 2003 (gmt 0)

10+ Year Member

The question i often ask is: Do search engines realy want to filter out duplicate content or are they too focused on being biggest?

I think we all have seen examples (not just in Google) of duplicate content that would be easy to filter out but is not. The dMoz clones was mentioned as a good example: Easy to filter out - so why is it not? Also, you'll often find the same pages indexed under http and www - why? On top of that comes the indexing of CSS-files, JS-files, 404-files, robots.txt files, search engines own search results etc. All in all a lot of pages that there is absolutely no reason to keep in the index - except for being biggest ...

I am sorry, but I do not think the search engines realy want to filter out all the garbage in their indexes. Sure, the engineers want that, but the management don't. As long as top management (and the press) think index size is a good metric for quality I don't think we will see any major reduction of duplicate content.

scorpion

3:08 pm on May 26, 2003 (gmt 0)

10+ Year Member

another problem with duplicate content is this: Which is the original? Which one duplicated the other? Is it a question of which has the oldest WHOIS record?

aravindgp

3:59 pm on May 26, 2003 (gmt 0)

10+ Year Member

I am interested in learning abt the paper but couldn't get access to it.Can somebody post different url.

Algorithm challenges by Monika R.

Marcia

6:39 am on Sep 3, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The PDF can't be accessed even from Google search, but here's the search to get to the HTML version which is still there

[google.com...]

This is of particular interest now, I'm having to start dealing with blatant duplication, innocent as the original intentions were. Two sites identical in layout, same products, same page naming - linked all into one shopping cart on one of the sites.

Spogum

3:36 pm on Sep 20, 2003 (gmt 0)

10+ Year Member

Here's what may be a "beginner's question" on the duplication issue. Two of my clients have, for whatever reason, purchased second or third domain names -- each pointing to the SAME website (thus, not duplicate or "mirror" sites, simply multiple names for the same site.) The question is whether this would have any impact, positive or negative, on Google rank -- or anything else.

[edited by: msgraph at 10:05 pm (utc) on Sep. 20, 2003]