duplicate content, scrapers and original sources

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

duplicate content, scrapers and original sources

stargeek

12:33 pm on Oct 5, 2005 (gmt 0)

we've all seen how MattCutts blog fails to rank for terms where he is the original source, like the joy of bacon polenta.
google is filtering his site, and showing duplicate content from the other sites which are "borrowing" his content.
does anyone have any ideas as to how google picks which sources to keep and which to throw out?

tedster

3:45 pm on Oct 5, 2005 (gmt 0)

For a good discussion about detecting duplicate content, check out Brett in msg#23 here:

[webmasterworld.com...]

However, something is shifting in recent days and some report that the duplicate checking at Google is going a bit wild. Of course, scraper sites have also gone wild, so they do have an issue to deal with. In the case of Matt's blog, I think some people set out to PROVE to him that there is an issue.

stargeek

5:05 pm on Oct 5, 2005 (gmt 0)

Here's Brett's answer to this thread's question from back then:

"What does Google do if it detects duplicate content?

Penalizes the second one found (with caveats). (As with almost ever Google penalty, there are exceptions we will get to in a minute).

What generally happens is the first page found is considered to be the original prime page. The second page will get buried deep in the results.

The exception (as always) - we believe - is high Page Rank. It is generally believe by some that mid-PR7 is considered the "white list" where penalties are dropped on a page - quite possibly - an entire site. This is why it is confusing to SEO's when someone says they absolutely know the truth about a penalty or algo nuance. The PR7/Whitelist exception takes the arguments and washes them."

but in the case of the joy of bacon polenta, the "first" source would have been matt cutts blog. And the scraper sites aren't "mid-PR7".

theBear

5:22 pm on Oct 5, 2005 (gmt 0)

tedster,

Issue, what issue?

I vote for problem instead of issue.

Now you can take issue with my vote or you can issue your own screed, however to call this duplicate content thingy an issue when many more descripive words are availible would be a travesty or a moronoxy ;-).

So far we have:

nph-proxy.pl
nph-proxy.cgi
go.php

(Some of these have been hidden via rewrite rulesets.)

and I expect a whole host of others both single page IP delivery scripts and site delivery scripts (think tracker2.php) out there.

I'm currently delivering crud to three sites that are doing it.

I've seen close to 15 others that have hit many members of this forum.

This is in addition to versions of those scripts that are harmless. So remember to aim before acting.

theBear

5:26 pm on Oct 5, 2005 (gmt 0)

stargeek, IIRC both Matt's and the "Big Bad Site" doing the nasty job were filtered out of the serps in a number of cases.

stargeek

5:35 pm on Oct 5, 2005 (gmt 0)

a search for the joy of bacon polenta
shows the "big bad site" not filtered out.

theBear

5:51 pm on Oct 5, 2005 (gmt 0)

stargeek,

That is because the searches I was doing were for other phrases on the pages then you did, the folks at "the big bad site" will post some comments I'm sure.

stargeek

6:12 pm on Oct 5, 2005 (gmt 0)

my favorite part is to compare a search for the joy of bacon polenta on google, yahoo and msn.

Lorel

8:51 pm on Oct 5, 2005 (gmt 0)

The reason Matt's blog doesn't rank well is probably because he covers too many topics and posts them all on the same page--a very long page--and unless there is a sharp focus then there is very low keyword density which translates into low rank.

rytis

9:48 pm on Oct 5, 2005 (gmt 0)

Lorel

Matt Cutts' blog does not rank at all for "joy of bacon polenta", until you click 'repeat the search with the omitted results included' at the end of SERPs.

This is similar to what happened to many WebmasterWorld members on Sept 22nd. Take any snippet of original text from anywhere on affected site, put it in quotes, search G - 100s and 1000s of scrapers copying the original, with original domain not seen until 'search ommitted results'. This in itself is no big deal - who will search for your unique snippets of text anyway. The problem is Sept.22 filter downgrades the whole domain for nearly all searches making Google traffic virtually nonexistant.

This particular case could be technically different from Sept.22 filter, but demonstrates that other sites can remove you from Google search, intentionally or not.

reseller

10:14 pm on Oct 5, 2005 (gmt 0)

Ok.

Here is some food for the thought.

There is a very special words combination which exist on Matt´s blog in the article of Bacon Polenta and which I used just for fun within one or two of my posts:

Mmmm. Bacon-y goodness

If you run a query
Mmmm. Bacon-y goodness

You will see the thread containing my post with the words Mmmm. Bacon-y goodness on top of the serp

[google.com...]

Matt´s blog no where!

Sorry Matt. Better luck next time ;-)

emartin

12:19 am on Oct 6, 2005 (gmt 0)

Stargeek, I am confused. When you say,

"The exception (as always) - we believe - is high Page Rank. It is generally believe by some that mid-PR7 is considered the "white list" where penalties are dropped on a page - quite possibly - an entire site. This is why it is confusing to SEO's when someone says they absolutely know the truth about a penalty or algo nuance. The PR7/Whitelist exception takes the arguments and washes them."

do you mean that sites with a PR of 7 or higher are NOT penalized for duplicate content, or that their sites can get wiped out entirely?

sparticus

12:25 am on Oct 6, 2005 (gmt 0)

And more to the point, are you talking about sites with a homepage PR of mid 7, or the page with the duplicate content being a mid 7?

wiseapple

2:50 am on Oct 6, 2005 (gmt 0)

Even if you run Mmmm. Bacon-y goodness with quotes... Darkseoteam outranks Matt's blog. Without the quotes, Matt is nowhere to be found. I also tried adding "&filter=0". This does not seem to bring him back to the top.

I am sure he does not need the traffic. However, I wonder if he ever thought his site would be used to demonstrate how outside factors can effect a site. The old statement - "no one can hurt a site rankings" seems to be faltering. Maybe these personal demonstrations will help them fix some of the issues.

stargeek

12:19 pm on Oct 6, 2005 (gmt 0)

"do you mean that sites with a PR of 7 or higher are NOT penalized for duplicate content, or that their sites can get wiped out entirely? "

I was quoting brett from an earlier post, I've seen no evidence of this PR7 penalty wipe.

doc_z

1:30 pm on Oct 6, 2005 (gmt 0)

The theory that high PR pages are not affected by Google's duplicate content filter is not valid. I've seen PR8 pages which are filtered out from the results.

theBear

4:07 pm on Oct 6, 2005 (gmt 0)

doc_z,

Maybe Google is using an unstable sort to order their canonicalized pages.

Or it could be that a cosmic ray went through one of the CPUs at the 'plex inducing an anomaly.

Lorel

4:18 pm on Oct 6, 2005 (gmt 0)

Rytis

This is similar to what happened to many WebmasterWorld members on Sept 22nd. Take any snippet of original text from anywhere on affected site, put it in quotes, search G - 100s and 1000s of scrapers copying the original, with original domain not seen until 'search ommitted results'. This in itself is no big deal - who will search for your unique snippets of text anyway. The problem is Sept.22 filter downgrades the whole domain for nearly all searches making Google traffic virtually nonexistant.
This particular case could be technically different from Sept.22 filter, but demonstrates that other sites can remove you from Google search, intentionally or not.

I search for original text quite often to track down scrapers. I'm wondering if they have a new method for scrapers to get your content captured onto their site by using base href to set up a cache of your site. see discussion here:
[webmasterworld.com...]

stargeek

1:17 pm on Oct 7, 2005 (gmt 0)

so the best bet we have so far as to what criteria google uses to determine "original source" is which page was found first by the spider?

reseller

8:42 pm on Oct 7, 2005 (gmt 0)

stargeek

>>so the best bet we have so far as to what criteria google uses to determine "original source" is which page was found first by the spider? <<

If I reacall correctly, GoogleGuy used the term "best page" not "original page" when choosing among originals and duplicates.

Question is how does GoogleGuy and the folks at the plex define a "best page"?

stargeek

9:24 pm on Oct 7, 2005 (gmt 0)

welp Best Page is not defined by:
Age
PageRank
Link Pop

any other ideas?

Angonasec

10:41 pm on Oct 7, 2005 (gmt 0)

Tried to post this as a new topic but all forum buttons have disappeared since the CSS update 3 months ago. (I tried the usual, clear cache, log out, login etc.)

But it does relate to this thread so here goes...

nph-proxy pages in G
How to safely block access

A very greasy hacker-type site has over 20,000 pgs in G, most of them duplicates of legitimate pages of other sites connected to via the hacker site's nph-proxy.cgi tool.
So the url listed in G is along the lines of: (I've added the * to replace the domain name)

[secure.*******lic.com...]

When you click the link in G you go straight to the Legitimate Site. G would automatically impose a duplicate penalty.

Questions:

1) Why does the hacker site do this? What do they get from the exercise, apart from harming other sites.
2) What do I say to G? Is it valid to send them a DMCA? I'm cautious about contacting G in case they hamfistedly remove the Legitimate pages. Anybody else reported this to G with success?

3) Is it safe to block access to our site to people visiting via nph-proxy.cgi tools? The only people I'd welcome using it would be punters using nph to avoid a government ban, as in China. Any other valid reasons for using nph-proxy?

4) What is the best way to block access to our site to nph-proxy (and variants) users?
Sample mod-rewrite code please.

Ta!

almar

2:52 pm on Nov 6, 2005 (gmt 0)

rytis-

We have the same, probelm as you mention, "This is similar to what happened to many WebmasterWorld members on Sept 22nd. Take any snippet of original text from anywhere on affected site, put it in quotes, search G - 100s and 1000s of scrapers copying the original, with original domain not seen until 'search ommitted results'. This in itself is no big deal - who will search for your unique snippets of text anyway. The problem is Sept.22 filter downgrades the whole domain for nearly all searches making Google traffic virtually nonexistant. "

What exactly are the Sept. 22 filters?

annej

5:28 pm on Nov 6, 2005 (gmt 0)

I thought they based supplimental for dup content on which site has the lower page rank. Anyway it seemed that way to me.

bobmark

11:40 pm on Nov 6, 2005 (gmt 0)

I can't tell you the satisfaction I got from seeing this message appear in the frame of a site that has been stealing my content:
IF YOU ARE SEEING THIS PAGE IN A FRAME ON ANOTHER SITE WHAT HAS HAPPENED IS THE SITE YOU ARE CURRENTLY VISITING HAS ATTEMPTED TO REPRODUCE OUR PAGE WITHOUT PERMISSION AND THEIR ACCESS HAS BEEN BLOCKED.
THE SITE YOU ARE VIEWING STEALS CONTENT FROM OTHER SITES AND PUTS IT IN FRAMES PAGES IN ORDER TO ATTRACT VISITORS TO EITHER SELL YOU GOODS OR SERVICES OR TO GENERATE "PAY-PER-CLICK" ADVERTISING.

Looks even better in bold, red 18 pt. :)

These people have screwed up my Google PR, cost me a lot of money and now made me spend time implementing a php blocking script to deliver this nifty little message instead of one of my pages, so its pretty small satisfaction.