Dupe Content Detection

Forum Moderators: open

Message Too Old, No Replies

Dupe Content Detection

milez

12:35 pm on Dec 28, 2003 (gmt 0)

Hello WW.

Does anyone have a clue about *how* Google decides which pages are similar to others?

Does it go all the way and checksums the text itself, or just stay at the page level and compares the the file names\checksums\size of the page as a whole?

Pavel.

percentages

12:02 pm on Dec 31, 2003 (gmt 0)

Google can spot an identical page...no doubt about that!

But, Google can't tell a close dupe from a similar page. I have pages which are 99.9% similar that it likes and indexes, and pages that are 80% similar that it refuses to index separately.

How the heck it is actually doing this is anybodies guess.

My theory is why worry? Throw them all into the cooking pot and most will come out tasting good :)

ciml

1:08 pm on Dec 31, 2003 (gmt 0)

> How come Duplicate Penalty so much talked about then?

That is something I have been asking myself for at least couple of years, wanna_learn. I don't think that the words "duplicate", "content" and "penalty" belong together when considering Google.

wanna_learn

9:05 pm on Dec 31, 2003 (gmt 0)

IMO, Google is far far away from identifying the Dupelicate Content. EXACTLY SIMILAR content could be caught to some extent, that is too on same website/IP.

Also Google would not entangle itself much in this thing when there are 1000s of Spam techniques needs to be addressed first on priority List.

People are smart enough to just simple not copying the content as it is rather
1) mixing copied with some orignal text
2) changing/tweaking language here and there.
3) partial copying etc etc

As far as Dupelicate Content Penalty is concerned, Google may impose them incase this happened within site or on same IP, assuming both belong to same Webmaster.

I was also tracking a site of my Compititor who created 300 pages with same Content, except title tag. Once it ranked very well for all 300 Pages but then after a month whole site was outta index.

I can tell it if somebody sticky me.

Edited - added the last lines :-)

anallawalla

3:17 am on Jan 1, 2004 (gmt 0)

DVDburning,

I submitted SPAM and copyright complaints to Google. I finally got the site offline, by going to the web hosting company... but Google still has all of these copied pages in the index.

If you follow the instructions on submitting a DMCA complaint (has to be by fax or post), Google will definitely remove the copied content from the index.

One client had a .co.uk and a .com site that were aliased at the server. One morning they woke to find that the .com site was out of the index but the .co.uk site was fine, being a UK company. They deleted the .co.uk site and found that they were completely unreachable although the .uk pages were still in the index. It took an email to Google to get that sorted and now they only use .com.

DVDBurning

10:57 pm on Jan 3, 2004 (gmt 0)

Anallawalla,
Thanks. I thought this might take longer, so I went after the web host at first. This worked twice, but now the site is back on line, on a web host in China. I'm pretty sure the offending webmaster lives in China, judging by some of the other sites that cross-link to this one, and other WhoIs info. He probably thinks he is out of reach, and can just get away with it. I wrote to the web host, but I doubt they will respond.

So today I faxed off a DMCA complaint to Google. They want you to list every search phrase that shows the offending pages in the results, and they want you to list every page that has copied content. I listed a couple, then told Google to do a Google search for "site:mysite.com widget", which shows the over 430 pages from my site that were copied in their entirety (including a press release announcing my site starting up!).

Man... what a pain. This is unreal. Due to the extensive Google spamming that this guy does, he has apparently given Google the impression that his pages are the original, and mine are duplicates. My PR dropped from 6 to 3, then came back to 4. His copied versions of my pages appear above my pages for every possible search phrase.

It certainly has me thinking more about what I can do to protect my content. I guess I need to implement some filters to block automated web bots like wget, or perhaps serve my pages dynamically so they can not be so easily copied (does this help?)

Any tips on preventing this kind of wholesale web copying would be greatly appreciated.

johnlim

8:17 am on Feb 1, 2004 (gmt 0)

HI,

How to block the WGET?

Thanks

DVDBurning

6:33 pm on Feb 1, 2004 (gmt 0)

.htaccess file is one way

There was a great thread in WW some time ago that dealt with recommended settings for blocking potentially bad bots... can anyone find it and resurrect it?

JudgeJeffries

8:25 pm on Feb 1, 2004 (gmt 0)

I dont beleive that Google has an effective filter. One of my competitors has 2 sites with 50 or 60 almost identical pages that often come up #1 and #2. I've snitched on them a couple of times to no avail and this has been going on for over a year. I feel particularly badly done by because my informative site is now dust whilst their dupes ride high on the hog. Come on Google, get your act together.

europeforvisitors

8:38 pm on Feb 1, 2004 (gmt 0)

A post in another thread mentioned that Google got a patent for a method of detecting duplicate and near-duplicate content in December, 2003. So it's reasonable to assume that Google is actively working on the problem.

JudgeJeffries

8:46 pm on Feb 1, 2004 (gmt 0)

What I failed to mention was that these particular sites stole 40 of my pages and for a while all 3 sites were at #1, #2 and #3. What a total nonsense. Non of them were hit. Does it really take 50 Ph.D's forever to come up with an answer.

bluenile

8:57 pm on Feb 1, 2004 (gmt 0)

A few days ago there was a thread in this forum stating that the age of page had bearing on its ranking. The theory said "Older the page higher was its ranking". Clearly going through this thread I get the impression that Google has no mechanism to check for the age of the page. If it had such a ability it would have no difficulty in judging which page is older and thus original.

bluenile

9:28 pm on Feb 1, 2004 (gmt 0)

[google.com...] Just Tell You Once: No More Duplicate Results
--------------------------------------------------------------------------------

Expect less from Google . . . less in the way of duplicate results!

Thanks to some engineering wizardry, we've dramatically reduced those pesky duplicate entries. This means better results returned with each search query.

Another improvement you may notice is a reduction in the number of returns from a single site. This means even if there are thousands of relevant pages on a single computer, you'll only get the first two, plus a link to "more results from host.com". In the old days you might have waded through multiple pages of results from one machine before getting to the next entry. Try searching on "java" and you'll see why this is so important.

europeforvisitors

9:32 pm on Feb 1, 2004 (gmt 0)

Weird. The Google-friends newsletter is hosted on Yahoo. I wonder if that will change?

bluenile

9:36 pm on Feb 1, 2004 (gmt 0)

[google.com...]

The <filter> parameter causes Google to filter out some of the results for a given search. This is done to enhance the user experience on Google.com, but for your application, you may prefer to turn filtering off in order to get the full set of search results.
When enabled, filtering takes the following actions:
Near-Duplicate Content Filter = If multiple search results contain identical titles and snippets, then only one of the documents is returned.
Host Crowding = If multiple results come from the same Web host, then only the first two are returned

The above two entries proves that Google has the ability to detect duplicate page content. But this also states that duplicate pages are ranked.

This 44 message thread spans 2 pages: 44