| This 50 message thread spans 2 pages: 50 (  2 ) > > || |
|Should Google Tank the Crowd Sourced Content Scrapers?|
We know that Google looks unkindly on scrapers. This duplicated content competes against the individuals/groups creating this content, who often feature adverts from their AdSense network. Sinking these sites to the bottom of the SERPs helps everyone. It helps Google offer credible results, people to find the originators of the content, and keeps the parasites out of the game.
Should the newest parasites, crowd sourced content scrapers, be similarly halted in their tracks before it’s too late?
Crowdsourced content scrapers (Pinterest.com, Weheartit.com, Loveit.com, Ehow.com/spark) are experiencing a surge in popularity this year. Pinterest, in particular, is increasingly throwing its weight around in the SERPs.
The overwhelming majority of the content of these websites is an infringement on someone’s copyright; rare are the people posting original content on, say, Pinterest, where original content may not even amount to 1%, sitewide. For many, Pinterest results in the SERPs are a nuisance, a mere extra step to get to the source website; that is, if the source website is credited appropriately, and not mis-attributed to Tumblr, Yahoo Images, or Pinterest itself. Most people “googling” something want some text, not just pictures and a misleading link.
These crowdsourced content scrapers all have NOFOLLOW outbound links, except for Loveit.com, who might have to shut the door once spammers begin to exploit it. These links are of very little help to the authors whose content is scraped in the SERPs.
Typically, content is scraped via a button that the users install in their bookmark bar, making scraping third party content a breezy, effortless,one-click affair.
Most of the scrapers will create a page with the URL template contentscraper.com/source/yourwebsite.com that often ranks quite highly for your keywords, and your domain name in the search query. Some visitors may prefer to view on content on Pinterest, and not visit your link in the SERPs.
Early on, some webmasters were hyping miraculous referral traffic volume from these scrapers. Lately, there are reports indicating that rather than leaving the confines of the scraper to follow links to the source, scraper visitors tend to remain on the scraper. (http://adweek.com/news/technology/buzzfeed-report-publishing-partners-demonstrates-power-social-web-143194)
A minority of crowdsourced content scrapers offer unique, proprietary opt-out mechanisms.
<meta name="LoveIt" content="nolove">
<meta name="ehow" content="noclip" />
<meta name="pinterest" content="nopin" />
The proliferation of these tags forces content providers into constant vigilance in monitoring new opt-out codes as they arise, and constantly update their websites accordingly. Notably, these aren’t sitewide htaccess commands, they need to be added to every single web page. Not everyone has dynamic content!
There are ways of course to figure out tricks to block these crowdsourced scrapers with htaccess, or substitute the scraped image for a copyright warning, but Ehow’s Spark grabs a screenshot of the browser display (stealing both images and text) and is the ultimate stealth scraper. The act of someone scraper your content with Ehow's bookmark tool is undetectable in web logs, and therefore unstoppable in htaccess.
DMCA take down notices, which were once practical for against conventional scraper, are obsolete against the army of crowdsourced content scrapers, whose users scrape content feverishly, and round the clock.
Should Google level the playing field and severely penalize these crowdsourced, copyright infringement and duplicated content machines?
Or should Google allow them to rise into greater prominence, as they might under current algos?
Man, google loves to have such properties...they would love to buy them...and would probably consider tanking them only if they don't oblige...
Google and several of its properties are crowd sourced, you know...
[edited by: indyank at 4:59 pm (utc) on Sep 5, 2012]
Google should simply ban them. These businesses are based on violating copyrights and stealing other people's content. Everyone knows it and Google knows it. It's about time they take action. Maybe after Panda and Penguin it's time for a "Copycat" algo update?
Pinterest gained popularity without Google. If it's true that Pinterest ranks well then it is ranking well because of link equity and other metrics of popularity. Furthermore, Pinterest has a value-add. It's not just the scraped content, it's the community. Just to make sure it's clear: Pinterest has a value-add, several actually, but perhaps the most important value-add is community.
I'm not defending Pinterest or trying to justify it's place in the SERPs. I'm simply pointing out the flaw in the premise underlying this discussion. Pinterest has a value-add, push-button scraper sites do not.
Think about movie trailer sites. Many if not most of the sites that rank for movie trailers have value-adds. The content, movie trailers, are pretty much exactly the same. The differentiator comes in the value-add.
Pinterst got to where it is through marketing and the popularity of it's community. Scrapers do not offer community or any other value add. Huge difference between the two.
Pinterst is not in the same class as pushbutton scraper sites. Anyone who chooses to view Pinterest in those terms is willfully closing their eyes not just to the truth/reality but to an opportunity of learning something about successfully marketing a website.
[edited by: martinibuster at 5:20 pm (utc) on Sep 5, 2012]
Would pinterest ever have gained the links, popularity, and 'community' without the scraping aspect of their business?
Business and web business model of the 2010's: do whatever it takes to get big, whether barely skirting the rules or breaking them. Then enjoy the fruits of your 'labor'. I'm not even saying that's bad. It's just important to understand the playing field.
Google just want's to show popular websites, and not judge their business model. I don't blame them for that.
|and not judge their business model |
Theft is not a business model ...
|Theft is not a business model ... |
Set your personal feelings aside and view the truth. If what you said was true then Pinterest would have been DMCA'd out of Google's index and no web host would take them for fear of having their Safe Harbor protections removed. There is a thing called Fair Use [copyright.gov] that distinguishes from theft and a fair use of an original work.
Any insistance on viewing Pinterest's model as theft is closing one's eyes to the truth. Choosing to believe an untruth is called willful ignorance. Don't go there. Set your personal feelings aside and see things as they are. Review the link I posted.
|There is a thing called Fair Use |
I don't want to nitpick and I agree with your point of view, but using high resolution images without permission from the copyright holder is infringement.
Wikipedia's usage of copyrighted photos is within fair use (ie. 250x250px). Google's usage of thumbnails in image search results might also be fair use. An 800x600px photo does not fall under fair use.
When the originator of an image is no longer profiting [as much] from their works because somebody else has infringed upon his/her intellectual property, then it may be viewed as a form of theft.
Google IS a crowdsource gatherer of data, it's being built right into their search results pages now (a.k.a. "the knowledge graph"), so it would be extremely hypocritical of them don't you think?
That being said of course they'll slam all but the biggest and most likely Wall st backed companies that do it. What do they care.
I don't think finer points of US copyright laws should even enter into this discussion. The content described by OP is by definition not original, sounds like a no-brainer that it should not be ranking [much]. We have to assume the content was acquired legally (else a site as prominent as Ehow would bleed dry through lawsuits) and yet 'legal' does not mean 'good quality' or even 'proper'. Should not rank ahead of the source, period.
It would be simple to deal with this manually, I think, but G is not willing to do it. And when you become as large as eHow or Pinterest, you probably overwhelm all the fine algo tweaking factors Google might throw at you. Perhaps there's a point where only manual intervention will do, and it's never going to happen - it took eHow all of 4-5 months to recover and come ahead from the first Panda hit in 2011 which was supposedly aimed squarely at them (by name!) - a content farm.
|We have to assume the content was acquired legally (else a site as prominent as Ehow would bleed dry through lawsuits) |
I also agree with you, 1script. But (IMO), sites like Pinterest were created outside of the spirit of the law. Precedence has been set with services like Kazaa, Napster, Megaupload, etc.: These sites were created specifically as a place to distribute copyrighted content. How is Pinterest different? Do they produce any real content themselves?
Services like YouTube, forums, image hosters, Facebook and other user-generated content sites were created with the intent of being a platform, not as a source of content (though we could debate the intentions of the aforementioned).
PirateBay was created as a place to distribute copies of music, videos, software they don't own. Pinterest was created as a place to upload and share content they do not own. Ehow has always been a place to distribute and profit off of content they didn't create (it's virtually all rewritten content -- precedence (against this practice) has been set on that topic several times in US law).
YouTube is a platform to share videos, but they have implemented sophisticated technology to disallow sharing of copyrighted media and have a very straight-forward DMCA policy (though I digress). Forums are generally a place for people to discuss facts and opinion. Image hosters seem to be used for sharing screenshots and other random photos to share on forums. Facebook is a social networking platform that was created without the intention of distributing other peoples intellectual property and UGC sites are much like forums.
There is a huge difference between protection under DMCA and the intent of profiting off of the intellectual property of others. I hate to drag this on, but these are very valid points.
But to the point: I think Google profits from sites which have no regard for intellectual property rights. This can be seen by the constant appearance of torrent sites which are nothing more than pirates or those profiting from piracy. Most of them aren't even well-designed, encourage malware distribution, have tons of pop-up ads and are generally harmful to both users and copyright holders... but users love free stuff and thus they thrive in Google. Wasn't Google's motto "Don't be evil"? Again, I digress.
The "don't be evil" [unofficial] motto has all but disappeared. If free stuff and gimmicks make their search engine more popular, then Google will continue dealing with the devil and creativity will dwindle.
Google wasn't always like this.
[edited by: Andem at 11:06 pm (utc) on Sep 5, 2012]
|else a site as prominent as Ehow would bleed dry through lawsuits) |
The fact that such sites haven't been sued to pieces doesn't prove much. The cost of legal action is out of reach for many people whose original content has been, um, inappropriately borrowed.
This is more of a discussion against plagiarised content rather than stolen content.
I personally feel that when something is substantially plagiarised it should be removed and treated as copyright theft, but I do not feel confident enough to send a Copyright removal request in case it was challenged. Also what happens if two or more sources are used? When does this become legitimate research?
I have had many pages plagiarised and often these copied pages outrank me. I have stopped writing articles on my site now because there is just no return anymore - I'm fed up with people taking my knowledge and research and experience and making money of it when I can't even get fresh pages to appear in Google's SERPS!
Eventually the real content writers will be forced to give up like me, and the scrapers and plagiarisers will have no one left to copy.
I have not been hit by any of the recent updates so am trusted by Google yet new pages still do not rank or show up?
IMO - Google should boot all UGC sites from the first page of listings, they just don't belong there. Niche sites should always rank higher than the large corporate "we'll cover everything in the world" sites.
|I'm fed up with people taking my knowledge and research and experience and making money of it |
Today I received an email from a guy in an Asian country who was offering to sell links. He had "great packages" available: on topic links on his site to my site to make my website "healthier". Big surprise when I had a look at his site: at least 10% of it is built around images copied from my site. How stupid (and rude) can you get?
Of course the worst content thief of all is Wikipedia. They take everything in your article, make slight changes in the wording, then create a new wikipedia page out of it. Within a short time this new wikipedia page replaces your page at the top of the SERPs and starts taking most of the traffic.
|Niche sites should always rank higher than the large corporate "we'll cover everything in the world" sites. |
There's a post in Supporters from 2003 (Depth of Content versus Themed Content [webmasterworld.com]) where Mike Grehan* interviewed a Google engineer [e-marketing-news.co.uk]. The engineer indicated Google preferred sites that had depth in a topic. Not necessarily that the entire site was themed around the topic, though. The reason for that is Google might overlook quality content because the site had breadth of topic. Here's a quote from Mike Grehan's interview:
|Mike: For the last edition of my book, one of the things I wanted to dispel was the notion of themed web sites. By that, I mean the idea that people had about trying to develop your entire web site a round a couple of key words. You know like, every page has to be about "blue widgets" and the domain should be "blue-widgets.com" yada, yada... I think it was nothing more than SEO propaganda the whole thing - what are your thoughts? |
Daniel: I think people sometimes mean different things by "themes." The statement above -- that somehow your blue widget site would be "weaker" if it contained a page about Tigers - is completely wrong.
No search engine would want to do that; having a page on Tigers doesn't affect your ability to be a resource for blue widgets. We'd miss good blue widget pages if we excluded the sites that also talk about Tigers.
However, there is a difference between "having a little Bit of content about blue widgets" and "having in-depth Content about blue widgets." Clearly we prefer in-depth (more useful) content. That's not so much a preference for themes as a preference for depth. "Utility" and "depth" really should be measured by a site's users.
*For those who may not know the name, Mike Grehan is an SEO pioneer, has been in the business pretty much since it's inception, currently producing SES conferences.
Are crowdsourced scrapers providing deep content? It's all duplicated image content, but with "likes" and "repins" or "relove" or "reclipped" or whatever lingo - plus some followers admiring your good taste. There is room for comments from the human scraper volunteers, but when they do bother, it's extraordinarily shallow. "I like these shoes." OK.
Ehow's Spark is trying to distinguish themselves from that model by grabbing either text or image, or both at once. Because the content grab is silent in the website's log, I suspect that they are making an image from areas of the user's screen display. Solidifying this theory, the text that is grabbed is displayed as an image. Technically, while they are infringing on more content than the image-only crowdsourced scrapers, it may appear to search engines' image recognition algos that these are new and fresh images when they combine both text and image, or have text only.
I think Google/Bing should decline indexing Spark altogether ;-)
We are seeing scraper sites like <snip> take copyright art from POD <print on demand> sites and post them on their site with Pin It buttons. This is without the permission or knowledge of the copyright owners. When these images are pinned by someone else they lead back to the scraper site with no way to reach the original source.
When the artist files a DMCA, the scraper site institutes a redirection of future searches to block the copyright owner from finding their work
[edited by: Robert_Charlton at 6:19 pm (utc) on Sep 8, 2012]
[edit reason] No specific domains per forum Charter [/edit]
Scrapper sites and crowdsourced sites are the kudzu of the internet.
Hardly, kudzu has some beneficial ( medicinal ) effects..
Scraper and "crowd sourced" ( mass copyright abuse ) sites have none..
While kuduz may have some beneficial properties, I was referring to it's to growth rate. In the south we know if you stand more than a few minutes within 5 feet of kuduz it will cover you up! And you dare not sleep anywhere nearby.
[edited by: Robert_Charlton at 6:35 pm (utc) on Sep 8, 2012]
[edit reason] removed specifics [/edit]
|The proliferation of these tags forces content providers into constant vigilance in monitoring new opt-out codes as they arise, and constantly update their websites accordingly. Notably, these aren’t sitewide htaccess commands, they need to be added to every single web page. Not everyone has dynamic content! |
Standard bot blocking techniques and anti-hotlinking for image leeches, both which have been around for many years, tend to be quite effective to thwart this crap before it starts.
Problem is that people are only REACTIVE to the issue of scraping, then waste their time with DMCA notices and all sorts of time and money wasting nonsense like Copyscape, etc. to locate repurposed content.
The best defense, which requires ZERO opt-out codes, is standard bot blocking which is a proactive method of content control, vs. the more costly and time consuming reactive alternatives.
Besides, if you aren't already embedding your website URL in your images already you're missing free advertising opportunities with some of these sites. These busy little crowd source scrapers are freely spreading your content far and wide and if you haven't properly tagged your content with watermarks, URLs, meta data, etc. then you're just losing out on spreading the message for free. What's more, wile some of your content may be ranking above your own domain name it's probably ranking above your COMPETITORS as well! Don't forget, just because you do a DMCA take down doesn't mean your content will rise in the index because the site with all the juice that ranks above you will most likely continue to rank above you but WITHOUT your content so put your URL in that image NOW!
Also, you don't need a dynamic website in order to insert tags site wide as you can easily add a PHP handler to all HTML pages in htaccess and then run a prepended script that loads the static HTML page, inserts the meta tags, and spits out the updated page. Quite trivial to implement really.
Of course only the watermarking part of the above defeats screen grabs..the rest ( inserting URLs, meta data, meta tags etc etc ) get stripped with just a 1 pixel crop on the image by the scraper..although the anti-hotlinking should be mandatory..
You meant "white listing" Bill surely ? ( you usually do ), "bot blocking" requires adding them to a block list faster than the scum can invent the scraper bots...not really feasible..
Whether or not that "publicity" is good for me is something that is mine to assess, and I have assessed that it is NOT good publicity at all. It's good publicity for the scraper. I had lost about 10% of my traffic prior to DMCA take-downs (over 5000), and this was recovered after the content was removed. THIS MAY BE COINCIDENCE. But that's the data I got.
Crowdsourced scrapers aren't hot-linking, but uploading to their servers. They are not bots and aren't bound by bot conventions. Hotlink protection and robots.txt aren't going to help.
I am curious about your trivial prepended script - I'd very much like to know how this is done as this give me something that I have to do once for each of my 20 websites every time I find a new crowdsourced scraper. It's a nuisance, but it's more within the reach of reasonableness.
|They are not bots and aren't bound by bot conventions. Hotlink protection and robots.txt aren't going to help. |
Who said anything about robots.txt? That's just a suggestion for good bots and many don't obey it anyway but that's another discussion for another day.
The humans are directing single purpose scraper bots/tools to make copies of stuff. Those tools don't always ID themselves as browsers all the time and even if they do, most often they can be stopped from copying things from sites without permission. You can't copy images off some of my servers unless you jump thru serious hoops, you REALLY have to want to steal it badly.
Not only that, you can build web pages that trick people and those tools. A prime example of my favorite trick is to make the image the BACKGROUND image and then use a transparent 1x1 size pixel which is resized to fit the image area (table cell, div, etc.). All they end up copying is blank transparent images. Not to mention all the tricks to disable the right mouse button and also disable view page source hot keys, etc.
|( inserting URLs, meta data, meta tags etc etc ) get stripped with just a 1 pixel crop on the image by the |
I'm not sure how stripping 1 pixel strips a URL typed across the image, 'splain it to me.
Must be confusing URLs on the image with meta data URLs in the image, totally different.
... and bot blocking encompasses both whitelisting and blacklisting.
I'm pretty sure that Ehow's Spark takes a screenshot of your browser display - if so, the image background substitution trick won't work.
Pinterest does identify itself in User_Agent when a pin is made, so it can be beat in htacess.
Loveit.com calls up an image from one of its servers, so it's identifiable, so it can also be beat in htaccess.
Spark? I have no clue how to beat that one. I am stumped.
I do think search engines should ponder giving these websites a taste of the SERP gutters.
|I'm not sure how stripping 1 pixel strips a URL typed across the image, 'splain it to me. |
|Must be confusing URLs on the image with meta data URLs in the image, totally different. |
As you mentioned watermarks before URLs and meta data in your list of tactics
|watermarks, URLs, meta data, |
I naturally took the URLs and meta data you were referring to as being "in the file" not "on the file"..anything "in the file" is stripped by a one pixel crop"..
On the file is "watermarking" whether the watermark is a URL or a phrase or a logo or a copyright symbol or any combination of them ..your website URL is the best one to put ( may get you traffic )..and the copyright symbol with the date ( gives notice to the cataclysmically dumb or thieving scum, that it s not public domain )..I have a script that prevents screen caps and all other ways of getting an image short of photographing the screen ..runs via jscript ( and jscript must be enabled to serve the images that it protects;)..plus the script itself is hidden and encrypted..heavily..run it with anti-hotlinking and my images are protected against everything bar taking a photo of the screen..
But I quit using it, many years back.. ( it has to take so much control of the browsers to nullify all possible copying actions that I actually was to all intents and purposes writing malware ( as it was indeed identified by a few AVs;)..plus it required the browsing machine to be rebooted before they got their cut and paste or right click back again after arriving on my page(s) ..not a good experience for the non larcenous visitors ;)
In the light of spark ( and doubtless the clones of it to come )..I may revisit it and tune it so as to not have the "reboot.. before you get most contextual menu commands and screen shot / print screen capability back" effect when it runs..
Basically this tells it that eHow's Spark was already loaded so when you click on the button in the toolbar it refuses to load it, it's dead, defunct.
This method could probably be used to stop any tool loaded in a toolbar without mucking up the meta tags. The only issue is it's not a supported method by the vendor so they can change that variable sparkLoaded to something else and it won't work, but it's an easy cat and mouse game to keep on top.
FWIW, that was much easier than my last frame buster script!
Kindred minds :)
Given that they will keep "spark" in the name in some form for a while ..it may be possible to spot the presence and even the name of the toolbar / clipper / whatever on the browser ( should work for other companies "scrape" tools that are as browser add ons too ) and either load /write scripts ( possibly on the fly ), to disable them ..or redirect the visitor who has them on their browser to "something else"..
"include" on every page..and like you say ..site wide that particular vector is stymied..
Something to while away some time..
A thought just struck me ..way to freak out the users of these scraper tools ..use one of the many "shake the page" ( including the text ) script commands, if they are detected using any of these things..:)
|Theft is not a business model ... |
The scrapers are always quick to point out it's not theft, it's politically correct term is 'infringement' since no physical goods are actually stolen.
I call it a load of crap.
Don't tell Google that theft isn't a business model. They've made billions scraping all our sites without permission. I don't have a problem with the search engine itself but the cache pages and screen shots are when it went from being a harmless utility to infringement IMO. They also make a ton off of AdSense scraper sites. Plus there's that YouTube thing which was built on the back of 'infringed' content.
I could go on and on, but theft, er um, infringement has been a very viable business model to date.
| This 50 message thread spans 2 pages: 50 (  2 ) > > |