Defining good bad bots

Forum Moderators: open

Message Too Old, No Replies

Defining good bad bots

smallcompany

6:00 pm on Feb 27, 2008 (gmt 0)

So, we find a bot not obeying robots.txt and we can say “bad bot”.

But in general, how do you distinguish between good and bad? What makes you decide to ban a certain IP?

I could think of:

- lack of respect for robots.txt,
- crawling speed and amount of pages loaded at the moment
- what else?

Or, for example, how do you know that so and so bot is a content scraper?

I am totally new to this and that is where my questions are coming from. I’ve never really truly analyzed my web logs, but want to. Just still not sure about good web logs analytics software.

digitalv

2:45 am on Feb 29, 2008 (gmt 0)

Here is one idea - you make a page that only exists for one purpose: to ban any IP addresses that access it, and shoot you an email alerting you of the action. You can do this through .htaccess if you're linux/unix hosted, or just write it to a text file or database if you're windows and make sure you have an include that's on all pages that checks to see if the visitor's IP is in the database and if so redirect them to a page that tells them they're locked out.

Basically if you go to this page, you get banned, period. Write the code for this, test it so you know it works, but then comment out the execution of the actual ban so it's not "live" just yet.

The next step is to add this page to your robots.txt marked as page that should NOT BE SPIDERED. Wait a few weeks before you do anything else just to make sure that any legit search engines that spider your site have had time to revisit and grab a new robots.txt - sometimes they go off a cached version so you don't want to be blocking legitimate search engines.

Once you're confident that all of the search engines have your newest robots.txt, turn on the code so the ban functions are live. Then the next step is to make a hidden link (using style="display:none;" is fine) to this page from every page on your site.

Still with me? Basically what you've done here is forced everyone to respect robots.txt - if a spider comes in that ignores it, it will attempt to crawl the URL to your auto-ban page as well, immediately locking itself out of your site. And since you coded the ban page to send you an email, you'll know right away if it happens.

On the back-end you can set expirations for the ban, etc., whatever you want to do. Usually 24 hours is enough, but you may want to go longer or go permanent. Whatever you do, do not TELL the banned visitor how long they are banned for, otherwise they'll just update their bots to revisit you one page at a time if they're really determined.

If you want, you can take this a step further by making 20-25 different auto-ban pages, and put them all in robots.txt, but only have one of them actually executing a ban at a time. This way you can rotate which page is doing the actual banning to thwart any scrapers who may be manually reviewing your pages and figure out how you banned them the first time.

jdMorgan

3:04 am on Feb 29, 2008 (gmt 0)

Good info, except for this:
> redirect them to a page that tells them they're locked out.

Return a 403-Forbidden status, but no additional information. Knowledge is power, so don't give any away.

If you return 403-Forbidden response for any other reasons --outside of the function described above-- then you may want to write the custom 403-Forbidden page in a neutral tone, and provide a link to another page for more information. This keeps your 403 response page short so as not to waste your bandwidth on bad-bots.

Obviously, even banned visitors must be allowed to view both of these pages. On the linked page, write a semi-apologetic message saying that there has been a problem, and to contact you if they need help. Provide a text-only link to an e-mail form (also accessible to banned clients and Disallowed in robots.txt) that submits to a very-well-locked-down e-mail script. None of the bad guys will bother, while you may get messages from a few innocent visitors who got banned due to errors in your script (for example, there are client browsers which don't handle all CSS coding variations properly -- notably mobile browsers, so you may trap a few innocents).

However, in all of this, do not give away any information that may aid the scrapers that you've trapped. Tell them there was a problem and access was denied. Don't say they're banned, or for how long. Don't say why. It's hard to write a custom 403 error page for innocent people *and* for hardened site abusers, but it can be done. :)

Jim

smallcompany

7:08 am on Feb 29, 2008 (gmt 0)

First of all – thank you so much. I really appreciate it. :)

I bookmarked this thread.
I will be studying this and searching through the web before I get back with additional questions (I am sure I’ll have to).

I am so positive about all what comes once you’ve published your site out there. I mean, so many people base their business on web site(s), and yet, they don’t know what web logs are. I won’t comment any further.

I started diving into this part by learning how to return 404. At the beginning, my custom 404 was actually returning 200. I fixed that with PHP.
Then I made my custom 404 to email me basic data.
Once seeing what was all coming to my site, I started thinking “infinite”.

…and I am very happy for it since I like this type of work. Just in right time when PPC started killing my soul.

Many thanks to WW and folks like you.