|Automatically detect crawlers / bots|
Iíve read the Wheels thread and spend a night reading the website of incredibill. [webmasterworld.com...]
Then I started programming but found out it was more difficult than I thought. Iím checking everyone who access robots.txt and set up some hidden links(not yet blocked in robots.txt) to see who is crawling my website. After a couple of hourís Iíve got more than 50 bots already, some of them of which I donít know if they are really bots (canít think of a way how to access certain hidden links unless you are not a bot?).
Incredibill recommended to use a whitelist to give access to bots, but how would you implement that? Do you give access to only these bots you have in your whitelist and block other bots by default?
And how would you block the unwanted bots without using too much resources? If I would all ipís in the database of unwanted botís this list will grow massively at this pace. The same if I would add them to .htaccess
The only way to do it effectively would be to detect ďbot behaviour ď and block by that. But Iíve searched the internet to find the ďbest wayĒ but couldnít find anything satisfying to start with.
I would like to know an efficient way to detect how fast a crawler is accessing my website, you do this in a database or in sessions orÖ? I did find some code snippets in other threads but none of them where working ďout of the boxĒ to test them en alter them to use for my website. ( [webmasterworld.com...] )
Anyone who can help?
Blocking Badly Behaved Bots #3 [webmasterworld.com]
Alex_K's script which wilderness directed to is a superb start and is the foundation of my bot blocking, although i've morphed into my own variant over the years.
what i would say is that it is very time intensive trying to block bots, however you can do a lot, save a lot of bandwidth and processing power on your server with just the basics.
you can block an awful lot by making use of the two threads pinned to the top of this forum, getting an ip to country database (paid or free) and blocking countries that it makes sense for you to block (could be different to other people)
No wonder I couldn't get my version to work, it seems it was missing huge important parts. So Thanks for the link to the full version.
I guess it would be smart to add some kind of feedback per mail when some bot is caught?
And what kind of bots do you allow? E.g. the social media report sites for companies when you or a user mention their name? And what crawler is used for Google alerts?
I would go further with the benefits:
If you prevent scrapers you can prevent a lot of the google site-duplication problems (and possibly even pandalization, but that's just an opinion).
You can also maintain a more secure site by rejecting virus-implanters. Even if your site/server is already secure, an extra string to the fiddle is always helpful.
Otherwise, yes. It is very time intensive to build and, in my experience, time-intensive to maintain. This past couple of weeks I've seen a signficant increase in "new" nasties (approx three-fold), mostly, it seems, from compromised servers.
In theory, if everyone ran virus-proof servers and broadband-connected computers then botnets would not exist. I wonder if it's possible to pursue some of the major server farms through legal channels? A lot of it is under their control, after all. probably not, though. :(
Globetrotter - if you have a linux-based web site (or IIS that implements .htaccess) your work-load is much easier than otherwise.
|This past couple of weeks I've seen a signficant increase in "new" nasties (approx three-fold), mostly, it seems, from compromised servers. |
A normal increase in holiday traffic that has just "slipped your mind".
No. These are true nasties. Our sites are not, for the most part, seasonal. The number decreased today - about twice normal.
Just as traffic ebbs and flows, exploits and botnets ebb and flow over hours, days, weeks, etc. So if nothing's changed on your server that would suddenly make it a haven, chalk things up to online life, note the worst of the worst, and start blocking by IP, Host, UA, URI, whatever's necessary.
I tend to block about 300 bots/day, most of them are the same 200 that keep knocking day after day. This is down from the heyday of my site when there were nearly 1K things blocked per day, it was insane!
I've added the script in two sections to my website. One at the beginning of the code and one part at the end. To make sure i didn't blocked to many things i've added an email notification to the code whenever a bot was blocked.
After uploading the files to the server i got so much reports something must be wrong :) I've used the default values I guesed these where safe to use but are they?
It also seems Bing Bot and a Google Feed Fetcher where blocked by the script. A DNS lookup seemed to confirm the ip's where from Bing and Google ? Any Ideas?
I think Iíve got the script to work the way I want it. I needed to change some of code implemented the whitelist and Iíve build a notification option so I get an email each time a bot is blocked. I did receive some mail but not as much as with the bot traps Iíve implemented.
So whatís the next step? Do I need to change my robots.txt to let all the bots know I only accept bing and google? Or should I block every bot who tries to crawl the website? Which might be a lot?
Iíve also have a bot trap, just to detect dumb bots, Iím not sure if my settings are well enough to catch those with the bot script as well. I have to see and compare the data after a couple of days. How would I block those bots when they checkout my bot trap with the same code?
Hopefully you could share some tips so I do not have to reinvent the wheel.
I'm using the script now for a while but I had to disable the slow scraper detection because it blocked visitors. But now also other users are complaining about the fastscraper part.
Any idea how to tweak the settings so I can still block bots. And what is a good way to validate if its really a user or a bot?
|I had to disable the slow scraper detection because it blocked visitors |
Yep, in general doing permanent blocks may backfire and what you experience is not the worse part. You cannot reliably tell if a request comes from a bot or human.
The worst part is, if your site generates revenue (selling online etc) and your competitor figures out you are using traps to ban IPs (that's not a remote possibility), he can use various HTML elements on his sites, pointing to your traps, to make sure all his clients are triggering those. There are also ways to redirect spiders if they know where your trap is located.
If you call this black hat SEO, take into account that what you're doing is called cloaking.