homepage Welcome to WebmasterWorld Guest from 54.196.206.80
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

Featured Home Page Discussion

This 57 message thread spans 2 pages: 57 ( [1] 2 > >     
Report a Scraper Outranking You - Matt Cutts tweet
netmeg




msg:4649833
 9:19 pm on Feb 27, 2014 (gmt 0)

Matt Cutts tweeted out that they're collecting information on scrapers that outrank you for your own content. No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

Original tweet:
https://twitter.com/mattcutts/status/439122708157435904 [twitter.com]

If you see a scraper URL outranking
the original source of content in Google, please tell us about it:
http://bit.ly/scraperspamreport

Scraper report link:
https://docs.google.com/forms/d/1Pw1KVOVRyr4a7ezj_6SHghnX1Y6bp1SOVmy60QjkF0Y/viewform [docs.google.com]

So when is WebmasterWorld gonna process secure links? *My* calendar says it's 2014
.

[edited by: Robert_Charlton at 10:05 pm (utc) on Feb 27, 2014]
[edit reason] added quote and sorta cheated to fix https links [/edit]

 

Robert Charlton




msg:4649847
 10:09 pm on Feb 27, 2014 (gmt 0)

No answer yet if they're actually going to *do* anything about it, or just use the data on the algorithm.

In my experience, applying the data to the algorithm is generally how Google does "*do*" things.

Lame_Wolf




msg:4649869
 10:51 pm on Feb 27, 2014 (gmt 0)

Will that go for Pinterest too?

netmeg




msg:4649871
 11:04 pm on Feb 27, 2014 (gmt 0)

Probably not.

Lame_Wolf




msg:4649875
 11:18 pm on Feb 27, 2014 (gmt 0)

That's a shame as it is one of the largest file stealing sites out there.

RedBar




msg:4650073
 3:51 pm on Feb 28, 2014 (gmt 0)

I've given it a go however I'm sure I've seen this before a couple of years ago.

I'll be amazed if they do anything about it, in any case with Google being the biggest spammer and hotlinker now I need more than a few hundred pages a day to get me back where I was before their image grab.

rogerd




msg:4650110
 5:52 pm on Feb 28, 2014 (gmt 0)

Google seems to be pretty good at de-ranking some scrapers. I had a site that had a bunch of articles copied by an apparently legit site - real business, real people. I couldn't find the copied content even with aggressive searching.

Maybe they needed a better scraping technique.

heisje




msg:4650111
 5:52 pm on Feb 28, 2014 (gmt 0)

Despite (reasonable) reservations, sounds good. As always, the proof of the pudding is in the eating. The future will show.

.

engine




msg:4650113
 6:12 pm on Feb 28, 2014 (gmt 0)

There are two things here:

1. How will the data submitted be used?

2. How will it affect the original site?

If it's a thin content scraper, surely, it'll be a good thing to nuke.

jmccormac




msg:4650118
 6:38 pm on Feb 28, 2014 (gmt 0)

In my experience, applying the data to the algorithm is generally how Google does "*do*" things.
Now that's an absolutely terrifying idea because it could be true. :) Surely it would be easy enough for Google to identify scrapers?

Regards...jmcc

netmeg




msg:4650133
 8:06 pm on Feb 28, 2014 (gmt 0)

So then this happened:

[searchengineland.com...]

ork ork

brotherhood of LAN




msg:4650140
 8:33 pm on Feb 28, 2014 (gmt 0)

I had that joke pinged to me... TBF wiki is downloadable via a tarball so it's not really scraping. Wiki is also a 'part' of the freebase collection which Google owns.

Still, I'm sure there are 10^100 other examples so the point is made ;o)

CaptainSalad2




msg:4650310
 11:59 am on Mar 1, 2014 (gmt 0)

I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

If that is true this is taking a much darker turn of events!

jmccormac




msg:4650326
 12:39 pm on Mar 1, 2014 (gmt 0)

Before or after the tweet? It is a bit unsettling though.

Regards...jmcc

turbocharged




msg:4650332
 1:09 pm on Mar 1, 2014 (gmt 0)

I read that after the guy made the tweet (about google scraping sites) all his sites received a manual penalty..... (He tweeted)

Better him then us! These days I do my best to avoid Google at all costs as I'm sure many other people here do too.

Let me tell you how one of our clients handled scraper sites that stole his content/were outranking him. He reported all the scraper sites in Webmaster Tools. He expected something to happen from that but after 90 days he contracted a SEO company that handled reputation management. According to the client, the SEO company charged a good buck but got all those scrapers removed in about a week. How did they do it? The client said the SEO company E-mail spammed the scraper sites and the hosts took the sites offline for TOS violations. A shrewd method, but for him it worked. I can see these types of "vigilante justice" attacks increasing as search engines like Google have given such a low priority to content theft. Will this latest action by Google do any good? Who knows and who knows how it will be used outside of its public stated purpose. Anyway, just the fact that Google is soliciting this information should confirm that their algorithm is flawed and incapable of determining the source of information without links.

CaptainSalad2




msg:4650338
 2:26 pm on Mar 1, 2014 (gmt 0)

Jacor in the guys twitter feed he say a couple of tweet later his sites just had penalties, after the original tweet... Scary stuff and vey sad if true, he probably has kids to take care of : /

graeme_p




msg:4650339
 2:37 pm on Mar 1, 2014 (gmt 0)

The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.

A scraper plagiarises (by not attributing or linking) and is copying in breach of copyright.

That is a huge difference.

Edge




msg:4650340
 2:40 pm on Mar 1, 2014 (gmt 0)

That's a shame as it is one of the largest file stealing sites out there.


Ever heard of Scribd?

Shepherd




msg:4650348
 4:16 pm on Mar 1, 2014 (gmt 0)

The problem with that tweet is that Google is taking content that they are allowed to take, either by normal copyright law (which allows copying small excerpts) or the terms of its license, showing on the site, and providing a link and attribution.


I'm going to disagree here. A scraper outranking the original content producer in a search engine is not on the surface a copyright issue. It's a "why is the search engine promoting the scraper instead of the originator" issue. Just because you can legally take information from wikipedia why should you outrank them for the exact same information that they originated? That's the issue here, not copyright.

Google, or any website, taking content, legal or not, and then outranking the originator for that same content is an issue in and of itself.

MrSavage




msg:4650357
 4:41 pm on Mar 1, 2014 (gmt 0)

Wasn't this news about 8 months ago? A scraper reporting tool? I'm just wondering what makes this different now than months and months ago.

Rosalind




msg:4650362
 5:17 pm on Mar 1, 2014 (gmt 0)

But Wikipedia itself is not a source. It's crowdsourced and manually written, but most of its content comes from other places.

tangor




msg:4650364
 5:35 pm on Mar 1, 2014 (gmt 0)

You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content. Thus any site duplicating that content is NOT the original source.

But that, of course, would be the Perfect World.

EditorialGuy




msg:4650394
 8:26 pm on Mar 1, 2014 (gmt 0)

You would think that with all the indexing power G has that they could and SHOULD note when a new site comes into their index... and that content thereto... and henceforth show THAT site as the originator of that content.


That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

Site A can try to protect itself by submitting its new pages immediately, but unless it makes that effort (as many or even most sites probably don't), how is Google to know that site A published the page before site B did?

Fortunately, Google has other signals at its disposal to determine who should be ranked for what. If site B is a typical worthless scraper site and site A has any value at all, site A's version of the page should be able to rank higher simply because site A has more authority, more trust, better inbound links, etc.

ColourOfSpring




msg:4650418
 9:15 pm on Mar 1, 2014 (gmt 0)

Is it just me, or does Google just seem so lame these days? I mean, "report a scraper"? Excuse my French, but WTF? Are we really in 2014, or did I just dream the last 15 years and we're actually still in 1999?

turbocharged




msg:4650438
 10:09 pm on Mar 1, 2014 (gmt 0)

That argument might work if every new page or site were indexed at the same time. But let's say that site A creates a page on Monday, site B scrapes it on Tuesday, and Google happens to crawl and index the site B page before it gets to the site A page.

The argument still holds. We've had a number of client sites scraped that are based on Wordpress. These sites are all configured to ping the search engines immediately after a new page goes live, and Google will crawl the page within minutes. Even years later someone can scrape a page and it may outrank the original. Coincidentally, most of these scrapers are using a Google owned property (Blogspot) to outrank the originals.

ColourOfSpring




msg:4650444
 10:48 pm on Mar 1, 2014 (gmt 0)

Might I point out the obvious: Google still can't tell the difference between site A and site B when both sites aren't obviously mega-authorative. And so we have Matt Cutts saying "hey guys, report to us when your page gets scraped and we'll see what we can do". So basically we're stuck in the 90s with duplication of content and who is the original author. Progress isn't slow. It's slower than that.

netmeg




msg:4650448
 11:29 pm on Mar 1, 2014 (gmt 0)

Google can't win here - if they don't ask for examples, people howl and deride, and when they do ask for examples, people howl and deride. In many cases, the same people.

<shrug>

Life goes on.

ColourOfSpring




msg:4650449
 11:33 pm on Mar 1, 2014 (gmt 0)

if they don't ask for examples, people howl and deride


I didn't howl and deride when they didn't ask for examples (who did?). Seriously - who wants Google's index to be shaped by sporadic, manual reports of scrapers? If you're right, we might as well manually curate the entire web with a million human beings. Who said "Google, shape your algo around manual reports of scrapers!". How can that realistically work given the scale of the problem? The problem's perfect for algorithmic based solutions. If not, then we're never going to beat this. I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that. spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation. Not just highlighting age here, but many other factors that Google strangely overlook and ignore (or more bizarrely, can't recognise). Clearly, I assumed too much of Google's algorithm - more my problem, than theirs :) - so back to the manual reports I guess...

aakk9999




msg:4650451
 12:02 am on Mar 2, 2014 (gmt 0)

I expected them to weigh up site A's page and site B's page, and give site A's page the benefit of the doubt even if site B's page was indexed first - based on pertinent factors. Authority and all that.
(Emphasis mine)

And then there will be complaints that Google ranks brands.... as netmeg said, Google cannot win here.

spammy.biz registered 3 months ago scraping and outranking authoritydomain.com registered in 1998 is a typical situation.

Hopefully the dataset of reported scraper sites will confirm this and the algo will learn.

brotherhood of LAN




msg:4650457
 12:26 am on Mar 2, 2014 (gmt 0)

Yeah, can't say there's an obvious answer due to Google not knowing which came first. Even then, the first publisher may not be the true owner of the content.

Tedster had mentioned PubSubHubDub last time I saw this topic covered. I like the idea of pinging a checksum and one or two words of a content block as it seems to be low overhead. A search engine adopting something like that would have to evaluate their snapshot of the web first, though.

This 57 message thread spans 2 pages: 57 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved