homepage Welcome to WebmasterWorld Guest from 54.166.15.152
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 55 message thread spans 2 pages: 55 ( [1] 2 > >     
Matt Cutts Announces Scraper Site Reporting Tool
viralvideowall




msg:4355700
 10:29 pm on Aug 26, 2011 (gmt 0)

Looks like Google may FINALLY be getting a clue that scraper sites have become a problem. I'm sure the idiots there will come up with a new algorithm that punishes the wrong people like Panda did.

[twitter.com...]

@mattcutts Matt Cutts
Scrapers getting you down? Tell us about blog scrapers you see: [goo.gl...] We need datapoints for testing.


https://docs.google.com/spreadsheet/viewform?formkey=dGM4TXhIOFd3c1hZR2NHUDN1NmllU0E6MQ&ndplr=1

Google is testing algorithmic changes for scraper sites (especially blog scrapers). We are asking for examples, and may use data you submit to test and improve our algorithms.

This form does not perform a spam report or notice of copyright infringement. Use https://www.google.com/webmasters/tools/spamreport?hl=en&pli=1 to report spam or [google.com...] to report copyright complaints.

Exact query that shows a scraping problem, such as a scraper outranking the original page:

[edited by: Brett_Tabke at 1:11 pm (utc) on Aug 27, 2011]

 

tedster




msg:4355707
 10:51 pm on Aug 26, 2011 (gmt 0)

From the new report form

Exact query that shows a scraping problem, such as a scraper outranking the original page: (Required) *
Copy the query from the Google search box

Wow - I like it.

Leosghost




msg:4355720
 11:21 pm on Aug 26, 2011 (gmt 0)

I posted the same on the other thread ..it is just as relevant here..

URL of specific scraper page: (Required)*

Anyone ( apart from Google ) have a list to hand of about half the URLs of the pages in Blogger..

They could just run their "algo" over their own properties ..and those running adsense ..to begin with ..and wipe out 90% of the scraper problem, without any form filling needed.

No one is going to go the the trouble of scraping content ..and then paying to host the scraper site ..unless they can monetise it ..and with adsense and blogger ..Google have enabled the easy monetisation of scrapers..and they host and or enable the largest percentage of the scrapers..and have knowingly done so for years.

viralvideowall




msg:4355725
 12:01 am on Aug 27, 2011 (gmt 0)

They need to just nuke Blogspot and that will fix most of the headaches. May take them about 9 years to figure that out

outland88




msg:4355787
 3:15 am on Aug 27, 2011 (gmt 0)

I actually think they just began to run something in the past week. The problem is they aren't nuking any of the offenders just giving credit to the original. Since it could paint a wide path might be some interesting changes in results. Maybe Google is just looking for some feedback.

CovertSEO




msg:4355813
 5:59 am on Aug 27, 2011 (gmt 0)

Uh so what happens in instances where the scraper reports the orginal producer as the sraper?

outland88




msg:4355818
 6:20 am on Aug 27, 2011 (gmt 0)

That's why they're asking for feedback or reports.

tedster




msg:4355824
 6:59 am on Aug 27, 2011 (gmt 0)

Remember that right before Panda Google tried to do a "Scraper Update" [webmasterworld.com] - but any real effectiveness went completely south when Panda was rolled out.

chrisv1963




msg:4355825
 7:02 am on Aug 27, 2011 (gmt 0)

I think they just let out a big secret, this is part of what Panda is all about.


This makes sense. My most scraped pages and websites were hit badly by Panda.

The bad news is that Google needs our help because their great algo doesn't see the difference between the original and scrapers.

tristanperry




msg:4355831
 7:53 am on Aug 27, 2011 (gmt 0)

Interesting :) So by scraper I assume they mean where content is copied almost entirely, and not sites (even bigger ones) who take an article and re-write it sentence by sentence?

Anywhoo, I genuinely wonder whether Google are aware that the (in my experience) majority of scraper cases (that is, content stolen word for word) are hosted on Google's servers, via Blogger/Blogspot?

Either way, it's good to see them improve this. I wonder whether this new algo update will see a refresh of Panda's data.

tedster




msg:4355832
 8:03 am on Aug 27, 2011 (gmt 0)

I don't assume that. Scraping an article and then "spinning" it by rewriting each sentence is a practice that Google already tries to catch. There have been several Google engineers posting about it on their Webmaster Forums. So if you see such pages outranking the original, and you feel like reporting them, I'm pretty sure they would take in the information.

chrisv1963




msg:4355847
 9:42 am on Aug 27, 2011 (gmt 0)

Should we report eHow too?
I have several samples of eHow articles, copied and re-written by their "contributors"

shazam




msg:4355855
 11:35 am on Aug 27, 2011 (gmt 0)

The reporting tool may appear to be helpful and in many cases I think it could be, but it's also a tool that could easily be used by the less ethical among us. I seriously doubt that they are going to invest the time required to thoroughly investigate the true source of any given content.

Then we are stuck with a final verdict and absolutely no way of contesting or proving the truth. Good luck contacting g and actually getting a real live common sense human to respond to you.

Even spending huge amounts of money with adwords for many years you cannot get a response. I stopped using adwords for the most part because of this. All of the other major ad and media buying networks will give you access to an account rep after you reach a fair amount of volume.

Good luck contacting g when they determine that your content IS the scraper and bury your sites.

Brett_Tabke




msg:4355861
 1:06 pm on Aug 27, 2011 (gmt 0)

I think this is very welcome that Google is looking at this as a serious issue. It can only help website owners.

note: I nuked about 10 frivolous comments here. (eg: knock it off with the flaming - even when it is google. you can have strong opinions, but simple flaming for the sake of flaming isn't welcome)

chrisv1963




msg:4355862
 1:15 pm on Aug 27, 2011 (gmt 0)

Maybe Google should use the DMCAs submitted to them as data for the algo. DMCAs are reviewed manually by Google people and should be trustworthy data.

tristanperry




msg:4355872
 2:33 pm on Aug 27, 2011 (gmt 0)

Thanks for your thoughts tedster, is good to know. If possible could you post up (or PM me) some examples of Google engineers talking about scraping etc on the Google forums? (I should frequent their forums more :)) - would find them useful.

aristotle




msg:4355883
 4:05 pm on Aug 27, 2011 (gmt 0)

Tedster said:
I don't assume that. Scraping an article and then "spinning" it by rewriting each sentence is a practice that Google already tries to catch. There have been several Google engineers posting about it on their Webmaster Forums. So if you see such pages outranking the original, and you feel like reporting them, I'm pretty sure they would take in the information.



I wonder if this applies to Wikipedia. A couple of years ago someone created a new Wikipedia page and copied virtually all of its content from a much older page on one of my sites. Most of the information was copied almost verbatim, with only a few minor changes in the wording. Nothing new has ever been added to the page since then. My page is still number 1 in Google for its main term, but the Wikipedia page is number 2, and will probably move ahead of mine eventually.

jmccormac




msg:4355891
 5:32 pm on Aug 27, 2011 (gmt 0)

This is quite worrying. Google doesn't have the mindpower to deal effectively with scraping so now it is, in effect, socialising the problem by getting the public and users to submit the details of scrapers. This is what led to the decline of many web directories when they had to rely on user submissions. It is a positive development in that it will solve a percentage of the problem however until Google manages to automate the process of detection, analysis and removal, it is still going to have a massive problem. At best, this is a start. At worst, it is the beginning of the end of Google as a search engine with a relatively clean index.

Regards...jmcc

tedster




msg:4355897
 6:08 pm on Aug 27, 2011 (gmt 0)

some examples of Google engineers talking about scraping etc


Here's one thread from Google's Webmaster Forums about "rewritten" or spun content. Google employee Wysz actually goes into one page sentence by sentence and shows the webmaster why Google penalized his entire network of 30+ sites with a -50 penalty:

Wysz: The images are hosted on buzznet.com, which has an article about the same concert...

If you read both articles, while the wording may not be exactly duplicate, there are very strong similarities.

First sentence from your site: "The Prince serenaded Leighton Meester during his concert at New York City's Madison Square Garden on Tuesday night (Jan. 18)."

First sentence from Just Jared: "Leighton Meester gets serenaded by the legendary Prince during his sold-out concert at New York Ctiy’s Madison Square Garden on Tuesday night (January 18)."

Note these phrases: "serenaded," "New York City's Madison Square Garden," "Tuesday night (Jan[uary] 18)"

Beyond the first sentence,
note the similar order and structure:
1. Leighton was sitting in the front row.
2. Prince invited her to the stage.
3. "I Don't Trust You Anymore" was playing
4. She was smiling and laughing/giggling
5. There were other celebrities there.
6. She’s wearing a cute sweater.

[google.com...]


This guy should feel completely outed by Wysz. His answer essentially said "you think you can change a few words around and call that 'original writing'? We don't want that kind of thing on the first page of our results. And your sites did it enough that we penalized the whole batch."

tristanperry




msg:4355901
 6:19 pm on Aug 27, 2011 (gmt 0)

Wow, that's pretty interesting. Thanks for the link/quote. Is good to know this sort of thing is definitely on Google's radar too (well, I know that's sort of obvious, but from all the stories from webmasters it sometimes feels like Google forget about it)

chrism




msg:4355905
 6:34 pm on Aug 27, 2011 (gmt 0)

The interesting part for me there is about the image, presumably used to determine the original author from the 'scraped and edited' copy

jmccormac




msg:4355906
 6:35 pm on Aug 27, 2011 (gmt 0)

Actually, now that I think about it, there is method in Google's madness. A lot of the modern media depends on press releases for filler content and you will see the same press release almost verbatim across a range of publications. If Google was to apply some kind of automated detection routine, then they may end up nuking most of its news sites.

Regards...jmcc

Planet13




msg:4355910
 6:53 pm on Aug 27, 2011 (gmt 0)

Ok, the paranoid huckster in me thinks this.

1) Set up a site that my competitors would want to steal content from.

2) Put a copyright notice saying that they are FREE TO USE THIS CONTENT On THEIR SITE if they are in a related business niche. (No need for a link back, even).

3) After they start copying the content, remove the copyright notice from my site and then report them to google as scrapers.

But seriously, folks: We have a few articles on our site that the original publisher said they could be copied and distributed freely. Because they have appeared on other sites, I have noindexed them. But what happens if the original publisher changes their mind and decides to report me as a scraper?

tedster




msg:4355920
 7:49 pm on Aug 27, 2011 (gmt 0)

Well, the report form asks for the query terms where the scraped content outranks the original. If the legitimate copy is noindex, then it shouldn't rank at all.

Leosghost




msg:4355922
 8:05 pm on Aug 27, 2011 (gmt 0)

@planet13
2) Put a copyright notice saying that they are FREE TO USE THIS CONTENT On THEIR SITE if they are in a related business niche. (No need for a link back, even).

There would be a "cache" at Google .just because they say they don't offer a "cache" of a page.. and they say they respect "no cache" and "no index" on a page.. doesn't necessarily mean they do not keep one for themselves ..many things indicate that in fact they do keep "cached" versions of everything in separate indexes..just means that the "cache" is not available outside G.

Occasionally we have seen "glimpses" down the years of these G "internal use only" indexes and caches.

So they would have something (2 )to compare..with (3)..and likewise if an original publisher changed their mind..I assume that in the latter case G would consider it to be a "civil or commercial dispute" between yourself and the publisher and thus probably would keep both versions in serps until a court made a decision.

bw100




msg:4355932
 9:01 pm on Aug 27, 2011 (gmt 0)

I'm having some difficulty getting my head around the example (provided by Tedster, above), where Google employee Wysz critiques Benjy's website.

What Google employee Wysz begins his critique, and criticism, with,
First sentence from your site: “The Prince serenaded Leighton Meester during his concert at New York City’s Madison Square Garden on Tuesday night (Jan. 18).”

First sentence from Just Jared: “Leighton Meester gets serenaded by the legendary Prince during his sold-out concert at New York Ctiy’s Madison Square Garden on Tuesday night (January 18).”

Note these phrases: “serenaded,” “New York City’s Madison Square Garden,” “Tuesday night (Jan[uary] 18)”

Beyond the first sentence, note the similar order and structure:
1. Leighton was sitting in the front row.
2. Prince invited her to the stage.
3. “I Don’t Trust You Anymore” was playing
4. She was smiling and laughing/giggling
5. There were other celebrities there.
6. She’s wearing a cute sweater.
[google.com ]

is fundamental formulaic reporting: the Five Ws: Who, What, When, Where, Why (and also usually How). Any reporter who covered this concert, and wrote about the moment, would have identified the same facts, and probably utilized similar vocabulary and descriptive style.

Without regard to the merits of Benjy's specific website:
To suggest (which Wysz does) that the similarity of vocabulary and descriptive phrasing used by any of the journalists covering the event to describe Prince serenading Leighton Meester is at best a "stretch", definitely ludicrous and IMO patently misleading.
To apply this type of standard will mean that most of the thousands of journalists reporting the news would be guilty of spam.
Hitchhiking on the comment by jmccormac above
If Google was to apply some kind of automated detection routine, then they may end up nuking most of its news sites.

What's Google's new "rule": first reporter / publication to break the story is the only original author / source, and all others are spammers?

nomis5




msg:4355939
 9:25 pm on Aug 27, 2011 (gmt 0)

matt cutts = google pr man. So his posts don't = the obvious.
My take on this is that google want us to use the 'author' facility of rich snippets to prove ownership - this is not a one off posting from matt about the subject and he will eventually get round to what google will really use to prove ownership.
And the 'authbor' facility will eventually give google two advantages. First easy proof of ownership from their point of view without the need for responding to your emails about violation. Second if you participate in the ownership sheme a whole pile more info about you and your site.
Think - google exists on data, and if you don't participate then they will treat you as a second rate site.

miozio




msg:4355944
 9:59 pm on Aug 27, 2011 (gmt 0)

I will have to report over 100 sites per scraped page. Do i just write them in the field below or I need to invest 1 year of my time on this?

tedster




msg:4355945
 10:02 pm on Aug 27, 2011 (gmt 0)

Any reporter who covered this concert, and wrote about the moment, would have identified the same facts, and probably utilized similar vocabulary and descriptive style.

But not in the same essential order and structure, with the same side details and quirks. Seriously, look at the articles quoted and linked to from Google's forum. The guy was just spinning other people's content and his whole network got nailed.

Even though the final results of the Google algo may be out in left field for some queries at any particular time, it is wise NOT to think of Google as "stupid". These are some very savvy folks, and if they set their minds to handle an issue, sooner or later there will be penalties.

jmccormac




msg:4355951
 10:22 pm on Aug 27, 2011 (gmt 0)

But not in the same essential order and structure, with the same side details and quirks.
When it is a press release from a company promoting an event or a device, then that is exactly what would happen. Ever read a newspaper where you see AP or Reuters credited at the end of an article?

Even though the final results of the Google algo may be out in left field for some queries at any particular time, it is wise NOT to think of Google as "stupid".
I don't think of them as being "stupid". Some are smart but that does not mean that people should have a fanboy attitude to Google's efforts to sidestep a problem because its employees just are't quite smart enough to solve it efficiently. This is why I think that Google's attempted Socialisation of the problem is a cop-out and tantamount to an admission of defeat.

These are some very savvy folks, and if they set their minds to handle an issue, sooner or later there will be penalties.
Really smart people would come up with a solution rather than just trying to apply penalties. Applying penalties does not solve the problem but then they probably have a patent where they claim it does.

Regards...jmcc

This 55 message thread spans 2 pages: 55 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved