Best way to keep scrapers out? - Website Security for Webmasters forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Best way to keep scrapers out?

maccas

8:40 am on Jul 4, 2018 (gmt 0)

10+ Year Member

Hi, I have problems with scraper sites and my latest site with a database of close to a million entries I am particularity worried about it being scraped. I was thinking of just some honey pot links with a php file writing the ip to my htaccess file. The problem with this is I am not sure it will 100% keep googlebot and other good spiders out. Robots, nofollow are not 100%?

"A robotted page can still be indexed if linked to from from other sites
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely)"

Obviously passwording will prevent the scrapers from getting written to htaccess as well and noindex is after the event.

What do other people do?

keyplyr

8:49 am on Jul 4, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Hi maccas,

You need to educated yourself about User Agents. What some sites see as bad bots or scrapers, others see as beneficial.

We discuss this daily in the Search Engine Spider & User Agent ID Forum [webmasterworld.com]

There are several Blocking Methods [webmasterworld.com] many webmasters use to stop the bad bots.

A large number of bad bots come from Amazon IP ranges [webmasterworld.com] and other Server Farm IP Ranges [webmasterworld.com]

- - -

TorontoBoy

11:33 am on Jul 4, 2018 (gmt 0)

Top Contributors Of The Month

From your raw access logs, available from cPanel, find out who is scraping you and then using your htaccess, ban them. If you can ban the worst offenders your server load will markedly drop. This needs to be done on a regular basis as bots might be 60% of your traffic. The risks of not doing anything, other than having your content stolen and used by others, is that you might exceed your server utilization and your site will return a "busy" 503 or 504 instead of serving up your web pages.

Project HoneyPot [projecthoneypot.org...] is a good place to go, I have used them and have had good success.

Bots are a fact of life. They are not going away. There are way more of them than you. There is significant work required to defend your area of the internet.

maccas

1:43 pm on Jul 4, 2018 (gmt 0)

10+ Year Member

Thanks, the thing is I would rather not play wack-a-mole also unless I ban them instantly it will be too late, they will already have my data or a fair chunk of it. That's why I would rather have a few traps set so anyone that is scraping is automatically added to my htaccess. My real question is what is the best way to keep good spiders out of this trap?

not2easy

1:55 pm on Jul 4, 2018 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

WebmasterWorld offered a robot trap [webmasterworld.com] for bots that don't follow robots.txt. It is pretty old (2004), but can help you automatically block and/or identify scrapers. I've used it and it works.

If you are password protecting the folder there is a low opportunity for scrapers. If search results for the database are displayed on a template, you can add a noindex meta tag to the template. That way, even if an external link shows a results page, the resulting URL has a no-index status.

I would still keep an eye on access logs to keep tabs on IPs and UAs. If/when you see unwanted activity you can get help in the various forums that keyplyr linked to in his post.

keyplyr

7:26 pm on Jul 4, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The robot trap script allows too many false positives. Many of us tried it back then. I kept rewriting it to reduce errors but eventually came to the conclusion that blocking with a broad approach simply doesn't work well. You will block too many beneficial agents & humans.

IPs change. IPs get reassigned. IPs get compromised by bad actor temporarily, then fixed. Company bots crawl from the same range as their employees firewall. Almost all Server Farms also lease ranges to ISPs. Almost all SEs lease ranges to anyone for any purpose. Beneficial agents sometimes break the rules. The list goes on...

If you use the Bad Bot Script, consistent oversight & trimming is needed.