Welcome to WebmasterWorld Guest from 54.196.42.8

Forum Moderators: open

Best way to keep scrapers out?

     
8:40 am on Jul 4, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 30, 2000
posts:506
votes: 2


Hi, I have problems with scraper sites and my latest site with a database of close to a million entries I am particularity worried about it being scraped. I was thinking of just some honey pot links with a php file writing the ip to my htaccess file. The problem with this is I am not sure it will 100% keep googlebot and other good spiders out. Robots, nofollow are not 100%?

"A robotted page can still be indexed if linked to from from other sites
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely)"

Obviously passwording will prevent the scrapers from getting written to htaccess as well and noindex is after the event.

What do other people do?
8:49 am on July 4, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12334
votes: 805


Hi maccas,

You need to educated yourself about User Agents. What some sites see as bad bots or scrapers, others see as beneficial.

We discuss this daily in the Search Engine Spider & User Agent ID Forum [webmasterworld.com]

There are several Blocking Methods [webmasterworld.com] many webmasters use to stop the bad bots.

A large number of bad bots come from Amazon IP ranges [webmasterworld.com] and other Server Farm IP Ranges [webmasterworld.com]

- - -
11:33 am on July 4, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 35


From your raw access logs, available from cPanel, find out who is scraping you and then using your htaccess, ban them. If you can ban the worst offenders your server load will markedly drop. This needs to be done on a regular basis as bots might be 60% of your traffic. The risks of not doing anything, other than having your content stolen and used by others, is that you might exceed your server utilization and your site will return a "busy" 503 or 504 instead of serving up your web pages.

Project HoneyPot [projecthoneypot.org...] is a good place to go, I have used them and have had good success.

Bots are a fact of life. They are not going away. There are way more of them than you. There is significant work required to defend your area of the internet.
1:43 pm on July 4, 2018 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 30, 2000
posts:506
votes: 2


Thanks, the thing is I would rather not play wack-a-mole also unless I ban them instantly it will be too late, they will already have my data or a fair chunk of it. That's why I would rather have a few traps set so anyone that is scraping is automatically added to my htaccess. My real question is what is the best way to keep good spiders out of this trap?
1:55 pm on July 4, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3910
votes: 223


WebmasterWorld offered a robot trap [webmasterworld.com] for bots that don't follow robots.txt. It is pretty old (2004), but can help you automatically block and/or identify scrapers. I've used it and it works.

If you are password protecting the folder there is a low opportunity for scrapers. If search results for the database are displayed on a template, you can add a noindex meta tag to the template. That way, even if an external link shows a results page, the resulting URL has a no-index status.

I would still keep an eye on access logs to keep tabs on IPs and UAs. If/when you see unwanted activity you can get help in the various forums that keyplyr linked to in his post.

7:26 pm on July 4, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12334
votes: 805


The robot trap script allows too many false positives. Many of us tried it back then. I kept rewriting it to reduce errors but eventually came to the conclusion that blocking with a broad approach simply doesn't work well. You will block too many beneficial agents & humans.

IPs change. IPs get reassigned. IPs get compromised by bad actor temporarily, then fixed. Company bots crawl from the same range as their employees firewall. Almost all Server Farms also lease ranges to ISPs. Almost all SEs lease ranges to anyone for any purpose. Beneficial agents sometimes break the rules. The list goes on...

If you use the Bad Bot Script, consistent oversight & trimming is needed.