Duplicate Content Exploit in Google

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate Content Exploit in Google

martinibuster

2:13 pm on Nov 6, 2012 (gmt 0)

A disturbing but I think important article on Dejanseo.com.au blog, How I Hijacked Rand Fishkin's Blog, [dejanseo.com.au] details a test run last month where pages from three blogs were volunteered as text subjects to test a possible bug in Google's algorithm that would allow someone to duplicate a web page and then replace it in the SERPs.

The results were mixed with some interesting discoveries.
.

[edited by: Robert_Charlton at 9:39 pm (utc) on Nov 7, 2012]
[edit reason] fixed typo at poster request [/edit]

goodroi

8:55 pm on Nov 6, 2012 (gmt 0)

Great read, time to make sure all my sites have canonical tags.

Planet13

9:20 pm on Nov 6, 2012 (gmt 0)

Thanks for sharing this!

It frustrates the heck out of me when ebay and amazon pages use my text.

zarathustra2011

9:44 pm on Nov 6, 2012 (gmt 0)

Since the algo change, all sorts of pitiful low quality sites rank way above me, for original text written years ago by my own hand. What bugs me is that the majority of these scraper sites are not at all popular or well rated, whereas my site was.

The majority of sites that have used parts of my text (without permission) have linked back to my original pages, yet it seems I'm no longer attributed as the author, and there's nothing I can do about it.

I have the canonical tags set on every page.

I sent a DMCA to Amazon just last week with a page that had used publisher text details copied and pasted from my homepage. To be fair to Amazon they not only removed the page and product itself, but totally shut-down the seller's entire account.
Problem is, if I want to start DMCA'ing everybody, I haven't the hours in the day to do it.

viral

11:07 pm on Nov 6, 2012 (gmt 0)

So what this says is that google doesn't give a #%^% about who originally made the content. When ever you see Matt Cutts talking about this he says we work very hard at making sure that the originator of the article gets the reward. Now we know this is BS. Awesome!

So basically buy yourself a big pr domain for $5k and start stealing other peoples content? Good to know!

MrSavage

1:15 am on Nov 7, 2012 (gmt 0)

I won't get into specifics but I'm dealing with the most vile and damaging beast I've ever seen. Various traditional attempts have failed any results.

It's what this article suggests. Your content used by somebody else and Google sets the authority to them and considers your site non existent. This is so bad, that copying parts of your own text, then search within quotes, will fetch the "other guys" site and to see yours, you need to click on that "see ommitted search results" or whatever that paragraph says to show those "duplicate" type pages in the Google results.

This is so nasty and frustrating what can I say. It took months and months to actually realize what happened to my site. It has been eaten alive. It's a massive issue and frankly if I did this to your site or anyone's site, you would likely get cut out of owning your own content. It's far greater an issue than a scrape job. This one eats up your existence. All the tools that Google provides have proven completely futile in my situation. Again, I can't mention who is doing this but I see it with other sites. I've seen the other victims. I hope Google figures this one out but I frankly don't have a phone number to call. This is a biggie.

tedster

3:28 am on Nov 7, 2012 (gmt 0)

The way I read this test, it shows that "sometimes" Google gets original attribution right and other times it gets it wrong. I also see a lot of odd mixed results for very long tail searches - the kind you might do if you already know what content is available, but not the kind you would do if you are REALLY searching for something.

Google has already stated that they do not think always returning the original source is the best thing to do - that sometimes people want a more recognizable source even if it's only quoting the original. I'm not sure how I feel about that because I really can see both sides. Getting the message out there (the content itself) could be paramount in some situations, but certainly not in others. In others, getting the website the traffic is the key thing. It's a very complex issue, as I see it.

viral

3:41 am on Nov 7, 2012 (gmt 0)

@tedster

I can see what you are trying to say but really if they keep attributing content to the high PR "scraper sites" and I include in this sites that re-write articles that barely pass as original. Eventually there will be no sites to make original content.

Ok I am taking this to the very extreme end but you see what I am trying to say? In a lot of these cases the genius behind the content is killed off. So Google just trying to get the content out there is not good enough!

tedster

3:57 am on Nov 7, 2012 (gmt 0)

However Google isn't doing any one thing consistently, especially given the long tail nature of the queries. So somehow the algorithm itself is trying to adjust for different situations. And as I recall, this is not anywhere NEAR a new situation. If anything, I think it's more limited than it was a year ago.

Another factor here is that the case studies are relatively new, and freshness also can tilt results toward a "copy" for a while.

At any rate, we're here in this forum to discuss the SEO aspects, not to somehow pass judgment on "right or wrong." We've got a situation that we work within, like it or not, and the better we understand it the better we can do within it. Google is not waiting for site owners and SEOs to vote on what they do, after all ;)

These case studies are very interesting for the light they shed on various corners of the algorithm and SEO. There are a lot of factors being reported on or at least hinted at here. Well worth a close look.

viral

4:10 am on Nov 7, 2012 (gmt 0)

@tedster

At any rate, we're here in this forum to discuss the SEO aspects, not to somehow pass judgment on "right or wrong."

true.. I can't help feel my agitation levels rise when I see things like this but you are right we aren't here to pass judgement but to analyze.

Sgt_Kickaxe

6:22 am on Nov 7, 2012 (gmt 0)

There is lots to wrap your head around in this article. Since we've decided not to pass judgment on right and wrong in this thread I have to say that I'm wanting to see if I can replicate what they did. Not for malicious reasons but to better understand the intentionally secretive ranking methods @Google. I'm not going to because Google has very specific guidelines against it, but wow.

I've always felt that duplicate is wrong, wrong, wrong and that using it was poison but Google didn't have Rands back on this one, it leaves a mere mortal webmaster feeling quite vulnerable.

viral

6:29 am on Nov 7, 2012 (gmt 0)

@Sgt_Kickaxe

I think the fact that they could do it to Rand is what freaked me out also! If you are in the USA try searching rand fishkin blog no exact quotes needed!

JennaIshley

9:17 am on Nov 7, 2012 (gmt 0)

i hate hacking. guys this is unethical activity. May be for some people this is just entertainment purpose but some other guys do serous damage

lucy24

10:39 am on Nov 7, 2012 (gmt 0)

Google has already stated that they do not think always returning the original source is the best thing to do - that sometimes people want a more recognizable source even if it's only quoting the original.

Quoting or plagiarizing? Ethics aside, they're functionally different things. They look different and they "read" different. Which one is google talking about?

martinibuster

1:33 pm on Nov 7, 2012 (gmt 0)

i hate hacking. guys this is unethical activity

This particular project was not hacking, nor was it unethical. Please read the article. Rand Fishkin participated in this experiment and volunteered a specific web page to be used for the experiment.

onebuyone

4:34 pm on Nov 7, 2012 (gmt 0)

Same test should be run on Bing.

diberry

5:08 pm on Nov 7, 2012 (gmt 0)

So it looks like rel="canonical" is probably significantly more helpful than authorship, but there's really no sure defense here.

Any thoughts on what we can do, then? Should we spend lots of time going through Copyscape and sending out DCMAs? That's all I got.

thedonald123

5:29 pm on Nov 7, 2012 (gmt 0)

From the article:

When there are two identical documents on the web, Google will pick the one with higher PageRank and use it in results. It will also forward any links from any perceived "duplicate" towards the selected "main' document.

Does this mean that as long as my website is ranking higher than all the sites stealing my content, they are now passing their PageRank /link juice to me?

And what about the inverse? If my Panda/Penguin/EMD penalized websites are ranking below the sites which stole my content does that mean that Google is giving them my PageRank /link juice?

The implications would be pretty amazing, actually mind-boggling.

martinibuster

6:48 pm on Nov 7, 2012 (gmt 0)

>>>that Google is giving them my PageRank /link juice?

It appears that it's giving it to that particular page, yes. Anyone have an opinion/theory/guess how this intersects with the disavowal tool and scraped content that links?...

chalkywhite

8:56 pm on Nov 7, 2012 (gmt 0)

So what thats saying is basically I can copy most rival sites pages ,titles and metas and i will will rank higher than the original if my PR is higher? scary

diberry

9:21 pm on Nov 7, 2012 (gmt 0)

I wonder if over time Google detects that the higher PR domain is just scraping and lowers its PR?

I can understand in the case of, say, a press release where loads of sites have republished something word for word with permission, Google would rank them in PR order. But for content that's just informational, I would think Google would intend to rank the original above any scrapers.

If so, that could mean that the problem is not so much that Google can't identify the original as that originality is not as important as other aspects of the algorithm. If Google is, say, equally concerned about originality and pagerank, then just making sure they know who's the original might not cut it.

lucy24

10:49 pm on Nov 7, 2012 (gmt 0)

It will also forward any links from any perceived "duplicate" towards the selected "main" document.

! So that's where they get the "via this intermediate link" stuff. It's not google's fevered imagination, it's a deliberate and calculated act.

:: wandering off in search of correct wording for "unauthorized mirror" meta tag ::

viral

12:17 am on Nov 8, 2012 (gmt 0)

All I know is that I am now in the market for some high PR domains! I got a couple anyway but I want so really big ones. Like pr 8! Not because I want steal other peoples content and outrank them but this has shown that PR is more important that I thought! For a long time now Matt and Danny and anyone else who had anything to say on the matter were saying that PR is less important than it was and losing importance by the day. I think shows that PR is just as important as it ever was and maybe more so in this post penguin and panda world.

indyank

4:44 pm on Dec 9, 2012 (gmt 0)

I got a couple anyway but I want so really big ones. Like pr 8! Not because I want steal other peoples content and outrank them but this has shown that PR is more important that I thought! For a long time now Matt and Danny and anyone else who had anything to say on the matter were saying that PR is less important than it was and losing importance by the day.

I think you are missing the crux. read Mr. martinibuster again...

@JennaIshley Hope you haven't signed up here to reply on this thread :)

Tami

2:46 am on Dec 13, 2012 (gmt 0)

zarathustra2011 and goodroi and all,

I have the canonical tags set on every page.

Great read, time to make sure all my sites have canonical tags.

I am confused. Are you using the rel=�canonical� tag on all your sites pages to protect your site from content thieves and scrappers?

I thought the rel=�canonical� tag was for duplicate pages within the same site. And, that you put the rel=�canonical� tag in the Header of the pages that are duplicate and point the tags link to the URL of the preferred non-duplicated page.

When you say "have the canonical tags set on every page" did you put the rel=�canonical� tag with a link to that pages own URL? For example: example.com/index.html the canonical tag in its header would be: <link rel="canonical" href="http://example.com/index.html/>

Is this correct? If so should we be doing this?

viral

3:03 am on Dec 13, 2012 (gmt 0)

I personally think canonical will do nothing as Tami said it is supposed to be a device that helps internal site structure. Secondly it is not hard for a scrapper site to scrape that tag out of the html.