Welcome to WebmasterWorld Guest from 54.234.38.8

Message Too Old, No Replies

Report a Scraper Outranking You - Matt Cutts tweet

     
9:19 pm on Feb 27, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12678
votes: 144


Matt Cutts tweeted out that they're collecting information on scrapers that outrank you for your own content. No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

Original tweet:
https://twitter.com/mattcutts/status/439122708157435904 [twitter.com]

If you see a scraper URL outranking
the original source of content in Google, please tell us about it:
http://bit.ly/scraperspamreport

Scraper report link:
https://docs.google.com/forms/d/1Pw1KVOVRyr4a7ezj_6SHghnX1Y6bp1SOVmy60QjkF0Y/viewform [docs.google.com]

So when is WebmasterWorld gonna process secure links? *My* calendar says it's 2014
.

[edited by: Robert_Charlton at 10:05 pm (utc) on Feb 27, 2014]
[edit reason] added quote and sorta cheated to fix https links [/edit]

10:09 pm on Feb 27, 2014 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:11317
votes: 169


No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

In my experience, applying the data to the algorithm is generally how Google does "*do*" things.
10:51 pm on Feb 27, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member lame_wolf is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Dec 30, 2006
posts:3224
votes: 9


Will that go for Pinterest too?
11:04 pm on Feb 27, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12678
votes: 144


Probably not.
11:18 pm on Feb 27, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member lame_wolf is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Dec 30, 2006
posts:3224
votes: 9


That's a shame as it is one of the largest file stealing sites out there.
3:51 pm on Feb 28, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Oct 14, 2013
posts:2116
votes: 157


I've given it a go however I'm sure I've seen this before a couple of years ago.

I'll be amazed if they do anything about it, in any case with Google being the biggest spammer and hotlinker now I need more than a few hundred pages a day to get me back where I was before their image grab.
5:52 pm on Feb 28, 2014 (gmt 0)

Administrator

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 2, 2000
posts: 9685
votes: 0


Google seems to be pretty good at de-ranking some scrapers. I had a site that had a bunch of articles copied by an apparently legit site - real business, real people. I couldn't find the copied content even with aggressive searching.

Maybe they needed a better scraping technique.
5:52 pm on Feb 28, 2014 (gmt 0)

Full Member

10+ Year Member

joined:June 4, 2005
posts: 240
votes: 31


Despite (reasonable) reservations, sounds good. As always, the proof of the pudding is in the eating. The future will show.

.
6:12 pm on Feb 28, 2014 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:22318
votes: 240


There are two things here:

1. How will the data submitted be used?

2. How will it affect the original site?

If it's a thin content scraper, surely, it'll be a good thing to nuke.
6:38 pm on Feb 28, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2415
votes: 24


In my experience, applying the data to the algorithm is generally how Google does "*do*" things.
Now that's an absolutely terrifying idea because it could be true. :) Surely it would be easy enough for Google to identify scrapers?

Regards...jmcc
8:06 pm on Feb 28, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12678
votes: 144


So then this happened:

[searchengineland.com...]

ork ork
8:33 pm on Feb 28, 2014 (gmt 0)

Moderator This Forum from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


I had that joke pinged to me... TBF wiki is downloadable via a tarball so it's not really scraping. Wiki is also a 'part' of the freebase collection which Google owns.

Still, I'm sure there are 10^100 other examples so the point is made ;o)
11:59 am on Mar 1, 2014 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:June 19, 2013
posts:388
votes: 0


I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

If that is true this is taking a much darker turn of events!
12:39 pm on Mar 1, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2415
votes: 24


Before or after the tweet? It is a bit unsettling though.

Regards...jmcc
1:09 pm on Mar 1, 2014 (gmt 0)

Preferred Member

joined:Oct 15, 2011
posts:429
votes: 0


I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

Better him then us! These days I do my best to avoid Google at all costs as I'm sure many other people here do too.

Let me tell you how one of our clients handled scraper sites that stole his content/were outranking him. He reported all the scraper sites in Webmaster Tools. He expected something to happen from that but after 90 days he contracted a SEO company that handled reputation management. According to the client, the SEO company charged a good buck but got all those scrapers removed in about a week. How did they do it? The client said the SEO company E-mail spammed the scraper sites and the hosts took the sites offline for TOS violations. A shrewd method, but for him it worked. I can see these types of "vigilante justice" attacks increasing as search engines like Google have given such a low priority to content theft. Will this latest action by Google do any good? Who knows and who knows how it will be used outside of its public stated purpose. Anyway, just the fact that Google is soliciting this information should confirm that their algorithm is flawed and incapable of determining the source of information without links.
2:26 pm on Mar 1, 2014 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:June 19, 2013
posts:388
votes: 0


Jacor in the guys twitter feed he say a couple of tweet later his sites just had penalties, after the original tweet... Scary stuff and vey sad if true, he probably has kids to take care of : /
2:37 pm on Mar 1, 2014 (gmt 0)

Senior Member from LK 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts:2419
votes: 17


The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.

A scraper plagiarises (by not attributing or linking) and is copying in breach of copyright.

That is a huge difference.
2:40 pm on Mar 1, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 29, 2001
posts:1081
votes: 16


That's a shame as it is one of the largest file stealing sites out there.


Ever heard of Scribd?
4:16 pm on Mar 1, 2014 (gmt 0)

Preferred Member from US 

Top Contributors Of The Month

joined:Oct 5, 2012
posts:644
votes: 34


The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.


I'm going to disagree here. A scraper outranking the original content producer in a search engine is not on the surface a copyright issue. It's a "why is the search engine promoting the scraper instead of the originator" issue. Just because you can legally take information from wikipedia why should you outrank them for the exact same information that they originated? That's the issue here, not copyright.

Google, or any website, taking content, legal or not, and then outranking the originator for that same content is an issue in and of itself.
4:41 pm on Mar 1, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Aug 5, 2009
posts:1233
votes: 137


Wasn't this news about 8 months ago? A scraper reporting tool? I'm just wondering what makes this different now than months and months ago.
5:17 pm on Mar 1, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 16, 2003
posts:992
votes: 0


But Wikipedia itself is not a source. It's crowdsourced and manually written, but most of its content comes from other places.
5:35 pm on Mar 1, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6160
votes: 284


You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content. Thus any site duplicating that content is NOT the original source.

But that, of course, would be the Perfect World.
8:26 pm on Mar 1, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:June 28, 2013
posts:2387
votes: 219


You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content.


That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

Site A can try to protect itself by submitting its new pages immediately, but unless it makes that effort (as many or even most sites probably don't), how is Google to know that site A published the page before site B did?

Fortunately, Google has other signals at its disposal to determine who should be ranked for what. If site B is a typical worthless scraper site and site A has any value at all, site A's version of the page should be able to rank higher simply because site A has more authority, more trust, better inbound links, etc.
9:15 pm on Mar 1, 2014 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 12, 2013
posts:500
votes: 0


Is it just me, or does Google just seem so lame these days? I mean, "report a scraper"? Excuse my French, but WTF? Are we really in 2014, or did I just dream the last 15 years and we're actually still in 1999?
10:09 pm on Mar 1, 2014 (gmt 0)

Preferred Member

joined:Oct 15, 2011
posts:429
votes: 0


That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

The argument still holds. We've had a number of client sites scraped that are based on Wordpress. These sites are all configured to ping the search engines immediately after a new page goes live, and Google will crawl the page within minutes. Even years later someone can scrape a page and it may outrank the original. Coincidentally, most of these scrapers are using a Google owned property (Blogspot) to outrank the originals.
10:48 pm on Mar 1, 2014 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 12, 2013
posts:500
votes: 0


Might I point out the obvious: Google still can't tell the difference between site A and site B when both sites aren't obviously mega-authorative. And so we have Matt Cutts saying "hey guys, report to us when your page gets scraped and we'll see what we can do". So basically we're stuck in the 90s with duplication of content and who is the original author. Progress isn't slow. It's slower than that.
11:29 pm on Mar 1, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12678
votes: 144


Google can't win here - if they don't ask for examples, people howl and deride, and when they do ask for examples, people howl and deride. In many cases, the same people.

<shrug>

Life goes on.
11:33 pm on Mar 1, 2014 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 12, 2013
posts:500
votes: 0


if they don't ask for examples, people howl and deride


I didn't howl and deride when they didn't ask for examples (who did?). Seriously - who wants Google's index to be shaped by sporadic, manual reports of scrapers? If you're right, we might as well manually curate the entire web with a million human beings. Who said "Google, shape your algo around manual reports of scrapers!". How can that realistically work given the scale of the problem? The problem's perfect for algorithmic based solutions. If not, then we're never going to beat this. I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that. spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation. Not just highlighting age here, but many other factors that Google strangely overlook and ignore (or more bizarrely, can't recognise). Clearly, I assumed too much of Google's algorithm - more my problem, than theirs :) - so back to the manual reports I guess...
12:02 am on Mar 2, 2014 (gmt 0)

Moderator This Forum from GB 

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2511
votes: 142


I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that.
(Emphasis mine)

And then there will be complaints that Google ranks brands.... as netmeg said, Google cannot win here.

spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation.

Hopefully the dataset of reported scraper sites will confirm this and the algo will learn.
12:26 am on Mar 2, 2014 (gmt 0)

Moderator This Forum from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


Yeah, can't say there's an obvious answer due to Google not knowing which came first. Even then, the first publisher may not be the true owner of the content.

Tedster had mentioned PubSubHubDub last time I saw this topic covered. I like the idea of pinging a checksum and one or two words of a content block as it seems to be low overhead. A search engine adopting something like that would have to evaluate their snapshot of the web first, though.
This 59 message thread spans 2 pages: 59