homepage Welcome to WebmasterWorld Guest from 184.73.40.21
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 59 message thread spans 2 pages: 59 ( [1] 2 > >     
Report a Scraper Outranking You - Matt Cutts tweet
netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 9:19 pm on Feb 27, 2014 (gmt 0)

Matt Cutts tweeted out that they're collecting information on scrapers that outrank you for your own content. No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

Original tweet:
https://twitter.com/mattcutts/status/439122708157435904 [twitter.com]

If you see a scraper URL outranking
the original source of content in Google, please tell us about it:
http://bit.ly/scraperspamreport

Scraper report link:
https://docs.google.com/forms/d/1Pw1KVOVRyr4a7ezj_6SHghnX1Y6bp1SOVmy60QjkF0Y/viewform [docs.google.com]

So when is WebmasterWorld gonna process secure links? *My* calendar says it's 2014
.

[edited by: Robert_Charlton at 10:05 pm (utc) on Feb 27, 2014]
[edit reason] added quote and sorta cheated to fix https links [/edit]

 

Robert Charlton

WebmasterWorld Administrator robert_charlton us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 10:09 pm on Feb 27, 2014 (gmt 0)

No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

In my experience, applying the data to the algorithm is generally how Google does "*do*" things.

Lame_Wolf

WebmasterWorld Senior Member lame_wolf us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4649831 posted 10:51 pm on Feb 27, 2014 (gmt 0)

Will that go for Pinterest too?

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 11:04 pm on Feb 27, 2014 (gmt 0)

Probably not.

Lame_Wolf

WebmasterWorld Senior Member lame_wolf us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4649831 posted 11:18 pm on Feb 27, 2014 (gmt 0)

That's a shame as it is one of the largest file stealing sites out there.

RedBar

WebmasterWorld Senior Member Top Contributors Of The Month



 
Msg#: 4649831 posted 3:51 pm on Feb 28, 2014 (gmt 0)

I've given it a go however I'm sure I've seen this before a couple of years ago.

I'll be amazed if they do anything about it, in any case with Google being the biggest spammer and hotlinker now I need more than a few hundred pages a day to get me back where I was before their image grab.

rogerd

WebmasterWorld Administrator rogerd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4649831 posted 5:52 pm on Feb 28, 2014 (gmt 0)

Google seems to be pretty good at de-ranking some scrapers. I had a site that had a bunch of articles copied by an apparently legit site - real business, real people. I couldn't find the copied content even with aggressive searching.

Maybe they needed a better scraping technique.

heisje

5+ Year Member



 
Msg#: 4649831 posted 5:52 pm on Feb 28, 2014 (gmt 0)

Despite (reasonable) reservations, sounds good. As always, the proof of the pudding is in the eating. The future will show.

.

engine

WebmasterWorld Administrator engine us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



 
Msg#: 4649831 posted 6:12 pm on Feb 28, 2014 (gmt 0)

There are two things here:

1. How will the data submitted be used?

2. How will it affect the original site?

If it's a thin content scraper, surely, it'll be a good thing to nuke.

jmccormac

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 6:38 pm on Feb 28, 2014 (gmt 0)

In my experience, applying the data to the algorithm is generally how Google does "*do*" things.
Now that's an absolutely terrifying idea because it could be true. :) Surely it would be easy enough for Google to identify scrapers?

Regards...jmcc

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 8:06 pm on Feb 28, 2014 (gmt 0)

So then this happened:

[searchengineland.com...]

ork ork

brotherhood of LAN

WebmasterWorld Administrator brotherhood_of_lan us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 8:33 pm on Feb 28, 2014 (gmt 0)

I had that joke pinged to me... TBF wiki is downloadable via a tarball so it's not really scraping. Wiki is also a 'part' of the freebase collection which Google owns.

Still, I'm sure there are 10^100 other examples so the point is made ;o)

CaptainSalad2



 
Msg#: 4649831 posted 11:59 am on Mar 1, 2014 (gmt 0)

I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

If that is true this is taking a much darker turn of events!

jmccormac

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 12:39 pm on Mar 1, 2014 (gmt 0)

Before or after the tweet? It is a bit unsettling though.

Regards...jmcc

turbocharged



 
Msg#: 4649831 posted 1:09 pm on Mar 1, 2014 (gmt 0)

I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

Better him then us! These days I do my best to avoid Google at all costs as I'm sure many other people here do too.

Let me tell you how one of our clients handled scraper sites that stole his content/were outranking him. He reported all the scraper sites in Webmaster Tools. He expected something to happen from that but after 90 days he contracted a SEO company that handled reputation management. According to the client, the SEO company charged a good buck but got all those scrapers removed in about a week. How did they do it? The client said the SEO company E-mail spammed the scraper sites and the hosts took the sites offline for TOS violations. A shrewd method, but for him it worked. I can see these types of "vigilante justice" attacks increasing as search engines like Google have given such a low priority to content theft. Will this latest action by Google do any good? Who knows and who knows how it will be used outside of its public stated purpose. Anyway, just the fact that Google is soliciting this information should confirm that their algorithm is flawed and incapable of determining the source of information without links.

CaptainSalad2



 
Msg#: 4649831 posted 2:26 pm on Mar 1, 2014 (gmt 0)

Jacor in the guys twitter feed he say a couple of tweet later his sites just had penalties, after the original tweet... Scary stuff and vey sad if true, he probably has kids to take care of : /

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4649831 posted 2:37 pm on Mar 1, 2014 (gmt 0)

The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.

A scraper plagiarises (by not attributing or linking) and is copying in breach of copyright.

That is a huge difference.

Edge

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4649831 posted 2:40 pm on Mar 1, 2014 (gmt 0)

That's a shame as it is one of the largest file stealing sites out there.


Ever heard of Scribd?

Shepherd



 
Msg#: 4649831 posted 4:16 pm on Mar 1, 2014 (gmt 0)

The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.


I'm going to disagree here. A scraper outranking the original content producer in a search engine is not on the surface a copyright issue. It's a "why is the search engine promoting the scraper instead of the originator" issue. Just because you can legally take information from wikipedia why should you outrank them for the exact same information that they originated? That's the issue here, not copyright.

Google, or any website, taking content, legal or not, and then outranking the originator for that same content is an issue in and of itself.

MrSavage

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 4:41 pm on Mar 1, 2014 (gmt 0)

Wasn't this news about 8 months ago? A scraper reporting tool? I'm just wondering what makes this different now than months and months ago.

Rosalind

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4649831 posted 5:17 pm on Mar 1, 2014 (gmt 0)

But Wikipedia itself is not a source. It's crowdsourced and manually written, but most of its content comes from other places.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 5:35 pm on Mar 1, 2014 (gmt 0)

You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content. Thus any site duplicating that content is NOT the original source.

But that, of course, would be the Perfect World.

EditorialGuy

WebmasterWorld Senior Member Top Contributors Of The Month



 
Msg#: 4649831 posted 8:26 pm on Mar 1, 2014 (gmt 0)

You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content.


That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

Site A can try to protect itself by submitting its new pages immediately, but unless it makes that effort (as many or even most sites probably don't), how is Google to know that site A published the page before site B did?

Fortunately, Google has other signals at its disposal to determine who should be ranked for what. If site B is a typical worthless scraper site and site A has any value at all, site A's version of the page should be able to rank higher simply because site A has more authority, more trust, better inbound links, etc.

ColourOfSpring



 
Msg#: 4649831 posted 9:15 pm on Mar 1, 2014 (gmt 0)

Is it just me, or does Google just seem so lame these days? I mean, "report a scraper"? Excuse my French, but WTF? Are we really in 2014, or did I just dream the last 15 years and we're actually still in 1999?

turbocharged



 
Msg#: 4649831 posted 10:09 pm on Mar 1, 2014 (gmt 0)

That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

The argument still holds. We've had a number of client sites scraped that are based on Wordpress. These sites are all configured to ping the search engines immediately after a new page goes live, and Google will crawl the page within minutes. Even years later someone can scrape a page and it may outrank the original. Coincidentally, most of these scrapers are using a Google owned property (Blogspot) to outrank the originals.

ColourOfSpring



 
Msg#: 4649831 posted 10:48 pm on Mar 1, 2014 (gmt 0)

Might I point out the obvious: Google still can't tell the difference between site A and site B when both sites aren't obviously mega-authorative. And so we have Matt Cutts saying "hey guys, report to us when your page gets scraped and we'll see what we can do". So basically we're stuck in the 90s with duplication of content and who is the original author. Progress isn't slow. It's slower than that.

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 11:29 pm on Mar 1, 2014 (gmt 0)

Google can't win here - if they don't ask for examples, people howl and deride, and when they do ask for examples, people howl and deride. In many cases, the same people.

<shrug>

Life goes on.

ColourOfSpring



 
Msg#: 4649831 posted 11:33 pm on Mar 1, 2014 (gmt 0)

if they don't ask for examples, people howl and deride


I didn't howl and deride when they didn't ask for examples (who did?). Seriously - who wants Google's index to be shaped by sporadic, manual reports of scrapers? If you're right, we might as well manually curate the entire web with a million human beings. Who said "Google, shape your algo around manual reports of scrapers!". How can that realistically work given the scale of the problem? The problem's perfect for algorithmic based solutions. If not, then we're never going to beat this. I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that. spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation. Not just highlighting age here, but many other factors that Google strangely overlook and ignore (or more bizarrely, can't recognise). Clearly, I assumed too much of Google's algorithm - more my problem, than theirs :) - so back to the manual reports I guess...

aakk9999

WebmasterWorld Administrator 5+ Year Member



 
Msg#: 4649831 posted 12:02 am on Mar 2, 2014 (gmt 0)

I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that.
(Emphasis mine)

And then there will be complaints that Google ranks brands.... as netmeg said, Google cannot win here.

spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation.

Hopefully the dataset of reported scraper sites will confirm this and the algo will learn.

brotherhood of LAN

WebmasterWorld Administrator brotherhood_of_lan us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4649831 posted 12:26 am on Mar 2, 2014 (gmt 0)

Yeah, can't say there's an obvious answer due to Google not knowing which came first. Even then, the first publisher may not be the true owner of the content.

Tedster had mentioned PubSubHubDub last time I saw this topic covered. I like the idea of pinging a checksum and one or two words of a content block as it seems to be low overhead. A search engine adopting something like that would have to evaluate their snapshot of the web first, though.

This 59 message thread spans 2 pages: 59 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved