Report a Scraper Outranking You - Matt Cutts tweet - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Report a Scraper Outranking You - Matt Cutts tweet

netmeg

9:19 pm on Feb 27, 2014 (gmt 0)

Matt Cutts tweeted out that they're collecting information on scrapers that outrank you for your own content. No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

Original tweet:
https://twitter.com/mattcutts/status/439122708157435904 [twitter.com]

If you see a scraper URL outranking
the original source of content in Google, please tell us about it:
http://bit.ly/scraperspamreport

Scraper report link:
https://docs.google.com/forms/d/1Pw1KVOVRyr4a7ezj_6SHghnX1Y6bp1SOVmy60QjkF0Y/viewform [docs.google.com]

So when is WebmasterWorld gonna process secure links? *My* calendar says it's 2014
.

[edited by: Robert_Charlton at 10:05 pm (utc) on Feb 27, 2014]
[edit reason] added quote and sorta cheated to fix https links [/edit]

Robert Charlton

10:09 pm on Feb 27, 2014 (gmt 0)

No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

In my experience, applying the data to the algorithm is generally how Google does "*do*" things.

Lame_Wolf

10:51 pm on Feb 27, 2014 (gmt 0)

Will that go for Pinterest too?

netmeg

11:04 pm on Feb 27, 2014 (gmt 0)

Probably not.

Lame_Wolf

11:18 pm on Feb 27, 2014 (gmt 0)

That's a shame as it is one of the largest file stealing sites out there.

RedBar

3:51 pm on Feb 28, 2014 (gmt 0)

I've given it a go however I'm sure I've seen this before a couple of years ago.

I'll be amazed if they do anything about it, in any case with Google being the biggest spammer and hotlinker now I need more than a few hundred pages a day to get me back where I was before their image grab.

rogerd

5:52 pm on Feb 28, 2014 (gmt 0)

Google seems to be pretty good at de-ranking some scrapers. I had a site that had a bunch of articles copied by an apparently legit site - real business, real people. I couldn't find the copied content even with aggressive searching.

Maybe they needed a better scraping technique.

heisje

5:52 pm on Feb 28, 2014 (gmt 0)

Despite (reasonable) reservations, sounds good. As always, the proof of the pudding is in the eating. The future will show.

.

engine

6:12 pm on Feb 28, 2014 (gmt 0)

There are two things here:

1. How will the data submitted be used?

2. How will it affect the original site?

If it's a thin content scraper, surely, it'll be a good thing to nuke.

jmccormac

6:38 pm on Feb 28, 2014 (gmt 0)

In my experience, applying the data to the algorithm is generally how Google does "*do*" things.

Now that's an absolutely terrifying idea because it could be true. :) Surely it would be easy enough for Google to identify scrapers?

Regards...jmcc

netmeg

8:06 pm on Feb 28, 2014 (gmt 0)

So then this happened:

[searchengineland.com...]

ork ork

brotherhood of LAN

8:33 pm on Feb 28, 2014 (gmt 0)

I had that joke pinged to me... TBF wiki is downloadable via a tarball so it's not really scraping. Wiki is also a 'part' of the freebase collection which Google owns.

Still, I'm sure there are 10^100 other examples so the point is made ;o)

CaptainSalad2

11:59 am on Mar 1, 2014 (gmt 0)

I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

If that is true this is taking a much darker turn of events!

jmccormac

12:39 pm on Mar 1, 2014 (gmt 0)

Before or after the tweet? It is a bit unsettling though.

Regards...jmcc

turbocharged

1:09 pm on Mar 1, 2014 (gmt 0)

I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

Better him then us! These days I do my best to avoid Google at all costs as I'm sure many other people here do too.

Let me tell you how one of our clients handled scraper sites that stole his content/were outranking him. He reported all the scraper sites in Webmaster Tools. He expected something to happen from that but after 90 days he contracted a SEO company that handled reputation management. According to the client, the SEO company charged a good buck but got all those scrapers removed in about a week. How did they do it? The client said the SEO company E-mail spammed the scraper sites and the hosts took the sites offline for TOS violations. A shrewd method, but for him it worked. I can see these types of "vigilante justice" attacks increasing as search engines like Google have given such a low priority to content theft. Will this latest action by Google do any good? Who knows and who knows how it will be used outside of its public stated purpose. Anyway, just the fact that Google is soliciting this information should confirm that their algorithm is flawed and incapable of determining the source of information without links.

CaptainSalad2

2:26 pm on Mar 1, 2014 (gmt 0)

Jacor in the guys twitter feed he say a couple of tweet later his sites just had penalties, after the original tweet... Scary stuff and vey sad if true, he probably has kids to take care of : /

graeme_p

2:37 pm on Mar 1, 2014 (gmt 0)

The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.

A scraper plagiarises (by not attributing or linking) and is copying in breach of copyright.

That is a huge difference.

Edge

2:40 pm on Mar 1, 2014 (gmt 0)

That's a shame as it is one of the largest file stealing sites out there.

Ever heard of Scribd?

Shepherd

4:16 pm on Mar 1, 2014 (gmt 0)

The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.

I'm going to disagree here. A scraper outranking the original content producer in a search engine is not on the surface a copyright issue. It's a "why is the search engine promoting the scraper instead of the originator" issue. Just because you can legally take information from wikipedia why should you outrank them for the exact same information that they originated? That's the issue here, not copyright.

Google, or any website, taking content, legal or not, and then outranking the originator for that same content is an issue in and of itself.

MrSavage

4:41 pm on Mar 1, 2014 (gmt 0)

Wasn't this news about 8 months ago? A scraper reporting tool? I'm just wondering what makes this different now than months and months ago.

Rosalind

5:17 pm on Mar 1, 2014 (gmt 0)

But Wikipedia itself is not a source. It's crowdsourced and manually written, but most of its content comes from other places.

tangor

5:35 pm on Mar 1, 2014 (gmt 0)

You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content. Thus any site duplicating that content is NOT the original source.

But that, of course, would be the Perfect World.

EditorialGuy

8:26 pm on Mar 1, 2014 (gmt 0)

You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content.

That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

Site A can try to protect itself by submitting its new pages immediately, but unless it makes that effort (as many or even most sites probably don't), how is Google to know that site A published the page before site B did?

Fortunately, Google has other signals at its disposal to determine who should be ranked for what. If site B is a typical worthless scraper site and site A has any value at all, site A's version of the page should be able to rank higher simply because site A has more authority, more trust, better inbound links, etc.

ColourOfSpring

9:15 pm on Mar 1, 2014 (gmt 0)

Is it just me, or does Google just seem so lame these days? I mean, "report a scraper"? Excuse my French, but WTF? Are we really in 2014, or did I just dream the last 15 years and we're actually still in 1999?

turbocharged

10:09 pm on Mar 1, 2014 (gmt 0)

That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

The argument still holds. We've had a number of client sites scraped that are based on Wordpress. These sites are all configured to ping the search engines immediately after a new page goes live, and Google will crawl the page within minutes. Even years later someone can scrape a page and it may outrank the original. Coincidentally, most of these scrapers are using a Google owned property (Blogspot) to outrank the originals.

ColourOfSpring

10:48 pm on Mar 1, 2014 (gmt 0)

Might I point out the obvious: Google still can't tell the difference between site A and site B when both sites aren't obviously mega-authorative. And so we have Matt Cutts saying "hey guys, report to us when your page gets scraped and we'll see what we can do". So basically we're stuck in the 90s with duplication of content and who is the original author. Progress isn't slow. It's slower than that.

netmeg

11:29 pm on Mar 1, 2014 (gmt 0)

Google can't win here - if they don't ask for examples, people howl and deride, and when they do ask for examples, people howl and deride. In many cases, the same people.

<shrug>

Life goes on.

ColourOfSpring

11:33 pm on Mar 1, 2014 (gmt 0)

if they don't ask for examples, people howl and deride

I didn't howl and deride when they didn't ask for examples (who did?). Seriously - who wants Google's index to be shaped by sporadic, manual reports of scrapers? If you're right, we might as well manually curate the entire web with a million human beings. Who said "Google, shape your algo around manual reports of scrapers!". How can that realistically work given the scale of the problem? The problem's perfect for algorithmic based solutions. If not, then we're never going to beat this. I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that. spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation. Not just highlighting age here, but many other factors that Google strangely overlook and ignore (or more bizarrely, can't recognise). Clearly, I assumed too much of Google's algorithm - more my problem, than theirs :) - so back to the manual reports I guess...

aakk9999

12:02 am on Mar 2, 2014 (gmt 0)

I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that.

(Emphasis mine)

And then there will be complaints that Google ranks brands.... as netmeg said, Google cannot win here.

spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation.

Hopefully the dataset of reported scraper sites will confirm this and the algo will learn.

brotherhood of LAN

12:26 am on Mar 2, 2014 (gmt 0)

Yeah, can't say there's an obvious answer due to Google not knowing which came first. Even then, the first publisher may not be the true owner of the content.

Tedster had mentioned PubSubHubDub last time I saw this topic covered. I like the idea of pinging a checksum and one or two words of a content block as it seems to be low overhead. A search engine adopting something like that would have to evaluate their snapshot of the web first, though.

This 59 message thread spans 2 pages: 59