Forum Moderators: phranque
I'm hacking away at a problem right now and was wondering if anyone out there had some insight on a reliable way to detect a robot (in general) from Apache rewrite rules?
In Perl I've been successful at determining (in general) if the visitor is a robot or not by using the following code:
use HTTP::BrowserDetect;
my $browser = new HTTP::BrowserDetect($user_agent_string);if($browser->robot){
# Do whatever...
}
Does anyone know if there is a similar condition check I can do in Apache rewrite rules? I found ways to check the user-agent for specific agents, but I was hoping for something more general like the above Perl solution that lets me do a simply yes/no check if the visitor is a robot, regardless of what type of robot.
Any ideas?
Thanks everyone.
So, while you may be viewing the PERL solution as "simple and easy," the fact is that it is likely "over-simplified and unreliable."
Detecting robots is not at all a simple task, and there are hundreds of threads here in the Apache, Search Engine Spider Identification, PERL, and PHP forums dealing with the many ways of detecting them. If you implement whitelists, blacklists, IP-address-range filtering, user-agent-string validation, and both the PERL and PHP robot-detection scripts posted here at WebmasterWorld, you will still not have "a reliable way to detect robots" if your definition of "reliable" means anything like "100% effective" -- There are simply too many robots masquerading as browsers to achieve a 100% detection rate. :(
Check out the forums cited above, and look in their "Library" sections for the threads and scripts that I mentioned.
Jim
Likewise, I am trying to move that logic into my Apache config, so the solution would not need to be bullet proof. I really don't even care the tell the difference between a good and bad bot for this particular requirement, only that it is a bot versus a browser. I realize some bots masquerade as browsers, and those I'm willing to let fly... I'm just looking for an 80/20 solution.
That said, do you think I need to build a white-list browser user-agents to help determine if something is a bot? Would that be the easier route to go?
Thanks again for the feedback.