Website scrapers & noindexed content

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Website scrapers & noindexed content

silentneedle

11:26 am on Sep 23, 2018 (gmt 0)

So I have multiple websites with UGC which I noindex if it isn't 'fleshy' enough as Google suggested us to avoid the big bad panda. Now I found out that there are website scrapers out there which copy my whole site (ofc they modify the canonical tag to point on their own url) which seem to rank quite good with those pages that are noindexed on my site, and they seem to do this over 1 year now, and they seem to receive a good amount of traffic.

Do you guys experience similar issues? I'm little bit lost, because I don't want to index said content on my pages, because I don't want to get panda'ed, but it's frustrating that those thiefs are making money with my users content.

I already sent takedowns to the hosting company (chinese -.-) & to google directly, but they doesn't seem to care at all.

keyplyr

1:42 am on Sep 24, 2018 (gmt 0)

I don't know whether you should release those noindexed page to Googlebot indexing. It's difficult to tell whether those pages will rank well on your site or drag you down. Maybe you could test a couple pages and see how that goes?

Going forward, you should consider proactive steps to stop scrapers by using Blocking Methods [webmasterworld.com]

Many of these malicious agents have been identified and discussed in the Search Engine Spider & User Agent ID Forum [webmasterworld.com]

silentneedle

12:07 pm on Sep 24, 2018 (gmt 0)

Any idea if google would see my noindexed content as a copy because the scrapers site got indexed first, even if the Googlebot visited my noindexed site first?

Selen

3:56 pm on Sep 24, 2018 (gmt 0)

That's an 800-pound gorilla in a room that nobody wants to discuss. The reality is - if you block Google / Bing via robots or noindex pages, your site / pages will be stolen by others and they will enjoy your work for free. Googlebot/Bingbot will index the first site that doesn't block them, so your noindexnig is not going to help your site (but it it can hurt it because you'd technically host duplicate content of your stolen content :)

silentneedle

4:11 pm on Sep 24, 2018 (gmt 0)

Great, so I have the option to index those low-quality pages and risk getting hit by panda, or keep them and losing traffic to scrapers and also risk my users getting phished. What the actual #*$! is Google doing here?

JesterMagic

4:45 pm on Sep 24, 2018 (gmt 0)

If your site is big enough it is going to get scraped. I block what I can (automated and manual) but there are always going to be ones that fly under the radar until you do an exact match search of your content. I use to go after the web hosts to take down the content but with Cloudfare and stuff now it gets really time consuming.

Now I just really use the Google Content Removal Tool if I see some of my content in search results. I don't really actively look for it as I rather spend my time updating my site.

The main problem with Googles Content Removal Tool is you have to do 1 URL at a time. Most scrapers copy hundreds of my pages. Why can't I do this by domain? It doesn't take a genius to see who has copied who by comparing sites and seeing 1 that has been around for almost 20 years and another that is under a year old with the same content but piles of broken links and ads inserted all over the place.

silentneedle

9:35 am on Sep 25, 2018 (gmt 0)

My biggest site receives about 900k visitors daily, and the scrapers have about 800k pages indexed from it. It's sad to say, but my best option is to hire someone who'll DDoS the scrapers servers. It would be too much risk involved to get hit by panda if I remove the noindex from my sites, because there are much single phrase pages (Q&A like). But nice to see that the scraper ranks well with them, thanks Google, this stuff gets frustrating.

iamlost

7:57 pm on Oct 11, 2018 (gmt 0)

JesterMagic: Why can't I do this by domain?
Why does Amazon co-mingle product?
Note: the foregoing is a rhetorical question.

Back to OP concern:
Regrettably it comes down to a business decision.

I actively block bots and I actively send out DMCA and other legal requests. While the bot blocking is mostly automagical the other most definitely is not, involving my law-type-person and a significant monthly bill. However, as my sites/content are covered by registered copyright and sufficient scrapers are within legal reach to date, on an annual basis, she usually rakes in more than she charges. :) YMMV.

Note: a legal judgement that is ignored may restrict travel/immigration plans - a club that can work wonders in some situations.

Many/most webdevs do not see the value in the extent of my efforts. In this as in much else.

However, the available remedies are really only, in order: ignore, block, request removal, legal action.

TorontoBoy

10:05 pm on Oct 11, 2018 (gmt 0)

Letting bots run wild might have an adverse effect on your server resources and stats. I'd get right on killing them off.

As for complaining to the Chinese companies, well, good luck with that. I doubt they will do anything for you.

keyplyr

2:54 am on Oct 14, 2018 (gmt 0)

If they can't get it, you don't need to get it removed.

Blocking Methods [webmasterworld.com]