Forum Moderators: Robert Charlton & goodroi
[webmasterworld.com...]
However, something is shifting in recent days and some report that the duplicate checking at Google is going a bit wild. Of course, scraper sites have also gone wild, so they do have an issue to deal with. In the case of Matt's blog, I think some people set out to PROVE to him that there is an issue.
"What does Google do if it detects duplicate content?
Penalizes the second one found (with caveats). (As with almost ever Google penalty, there are exceptions we will get to in a minute).
What generally happens is the first page found is considered to be the original prime page. The second page will get buried deep in the results.
The exception (as always) - we believe - is high Page Rank. It is generally believe by some that mid-PR7 is considered the "white list" where penalties are dropped on a page - quite possibly - an entire site. This is why it is confusing to SEO's when someone says they absolutely know the truth about a penalty or algo nuance. The PR7/Whitelist exception takes the arguments and washes them."
but in the case of the joy of bacon polenta, the "first" source would have been matt cutts blog. And the scraper sites aren't "mid-PR7".
Issue, what issue?
I vote for problem instead of issue.
Now you can take issue with my vote or you can issue your own screed, however to call this duplicate content thingy an issue when many more descripive words are availible would be a travesty or a moronoxy ;-).
So far we have:
nph-proxy.pl
nph-proxy.cgi
go.php
(Some of these have been hidden via rewrite rulesets.)
and I expect a whole host of others both single page IP delivery scripts and site delivery scripts (think tracker2.php) out there.
I'm currently delivering crud to three sites that are doing it.
I've seen close to 15 others that have hit many members of this forum.
This is in addition to versions of those scripts that are harmless. So remember to aim before acting.
Matt Cutts' blog does not rank at all for "joy of bacon polenta", until you click 'repeat the search with the omitted results included' at the end of SERPs.
This is similar to what happened to many WebmasterWorld members on Sept 22nd. Take any snippet of original text from anywhere on affected site, put it in quotes, search G - 100s and 1000s of scrapers copying the original, with original domain not seen until 'search ommitted results'. This in itself is no big deal - who will search for your unique snippets of text anyway. The problem is Sept.22 filter downgrades the whole domain for nearly all searches making Google traffic virtually nonexistant.
This particular case could be technically different from Sept.22 filter, but demonstrates that other sites can remove you from Google search, intentionally or not.
R
Here is some food for the thought.
There is a very special words combination which exist on Matt´s blog in the article of Bacon Polenta and which I used just for fun within one or two of my posts:
Mmmm. Bacon-y goodness
If you run a query
Mmmm. Bacon-y goodness
You will see the thread containing my post with the words Mmmm. Bacon-y goodness on top of the serp
[google.com...]
Matt´s blog no where!
Sorry Matt. Better luck next time ;-)
"The exception (as always) - we believe - is high Page Rank. It is generally believe by some that mid-PR7 is considered the "white list" where penalties are dropped on a page - quite possibly - an entire site. This is why it is confusing to SEO's when someone says they absolutely know the truth about a penalty or algo nuance. The PR7/Whitelist exception takes the arguments and washes them."
do you mean that sites with a PR of 7 or higher are NOT penalized for duplicate content, or that their sites can get wiped out entirely?
I am sure he does not need the traffic. However, I wonder if he ever thought his site would be used to demonstrate how outside factors can effect a site. The old statement - "no one can hurt a site rankings" seems to be faltering. Maybe these personal demonstrations will help them fix some of the issues.
This is similar to what happened to many WebmasterWorld members on Sept 22nd. Take any snippet of original text from anywhere on affected site, put it in quotes, search G - 100s and 1000s of scrapers copying the original, with original domain not seen until 'search ommitted results'. This in itself is no big deal - who will search for your unique snippets of text anyway. The problem is Sept.22 filter downgrades the whole domain for nearly all searches making Google traffic virtually nonexistant.This particular case could be technically different from Sept.22 filter, but demonstrates that other sites can remove you from Google search, intentionally or not.
I search for original text quite often to track down scrapers. I'm wondering if they have a new method for scrapers to get your content captured onto their site by using base href to set up a cache of your site. see discussion here:
[webmasterworld.com...]
>>so the best bet we have so far as to what criteria google uses to determine "original source" is which page was found first by the spider? <<
If I reacall correctly, GoogleGuy used the term "best page" not "original page" when choosing among originals and duplicates.
Question is how does GoogleGuy and the folks at the plex define a "best page"?
But it does relate to this thread so here goes...
nph-proxy pages in G
How to safely block access
A very greasy hacker-type site has over 20,000 pgs in G, most of them duplicates of legitimate pages of other sites connected to via the hacker site's nph-proxy.cgi tool.
So the url listed in G is along the lines of: (I've added the * to replace the domain name)
When you click the link in G you go straight to the Legitimate Site. G would automatically impose a duplicate penalty.
Questions:
1) Why does the hacker site do this? What do they get from the exercise, apart from harming other sites.
2) What do I say to G? Is it valid to send them a DMCA? I'm cautious about contacting G in case they hamfistedly remove the Legitimate pages. Anybody else reported this to G with success?
3) Is it safe to block access to our site to people visiting via nph-proxy.cgi tools? The only people I'd welcome using it would be punters using nph to avoid a government ban, as in China. Any other valid reasons for using nph-proxy?
4) What is the best way to block access to our site to nph-proxy (and variants) users?
Sample mod-rewrite code please.
Ta!
We have the same, probelm as you mention, "This is similar to what happened to many WebmasterWorld members on Sept 22nd. Take any snippet of original text from anywhere on affected site, put it in quotes, search G - 100s and 1000s of scrapers copying the original, with original domain not seen until 'search ommitted results'. This in itself is no big deal - who will search for your unique snippets of text anyway. The problem is Sept.22 filter downgrades the whole domain for nearly all searches making Google traffic virtually nonexistant. "
What exactly are the Sept. 22 filters?
Looks even better in bold, red 18 pt. :)
These people have screwed up my Google PR, cost me a lot of money and now made me spend time implementing a php blocking script to deliver this nifty little message instead of one of my pages, so its pretty small satisfaction.