Detecting if a visitor is a robot in Apache?

Forum Moderators: phranque

Message Too Old, No Replies

Detecting if a visitor is a robot in Apache?

Is there a reliable way to detect if a robot using a Rewrite Condition?

maximillianos

12:31 pm on May 9, 2008 (gmt 0)

Hello!

I'm hacking away at a problem right now and was wondering if anyone out there had some insight on a reliable way to detect a robot (in general) from Apache rewrite rules?

In Perl I've been successful at determining (in general) if the visitor is a robot or not by using the following code:


use HTTP::BrowserDetect;
my $browser = new HTTP::BrowserDetect($user_agent_string);if($browser->robot){
# Do whatever...
}

Does anyone know if there is a similar condition check I can do in Apache rewrite rules? I found ways to check the user-agent for specific agents, but I was hoping for something more general like the above Perl solution that lets me do a simply yes/no check if the visitor is a robot, regardless of what type of robot.

Any ideas?

Thanks everyone.

jdMorgan

3:34 pm on May 9, 2008 (gmt 0)

Your PERL code relies on the BrowserDetect object to detect robots. This is OK as long as it is kept up-to-date on a weekly or monthly basis, and uses further algorithmic methods to detect spoofed robots. Unfortunately, that is not likely.

So, while you may be viewing the PERL solution as "simple and easy," the fact is that it is likely "over-simplified and unreliable."

Detecting robots is not at all a simple task, and there are hundreds of threads here in the Apache, Search Engine Spider Identification, PERL, and PHP forums dealing with the many ways of detecting them. If you implement whitelists, blacklists, IP-address-range filtering, user-agent-string validation, and both the PERL and PHP robot-detection scripts posted here at WebmasterWorld, you will still not have "a reliable way to detect robots" if your definition of "reliable" means anything like "100% effective" -- There are simply too many robots masquerading as browsers to achieve a 100% detection rate. :(

Check out the forums cited above, and look in their "Library" sections for the threads and scripts that I mentioned.

Jim

maximillianos

3:41 pm on May 9, 2008 (gmt 0)

Thanks jdMorgan. I understand my perl solution is not 100% accurate, but how I use it does not require such accuracy. Right now I am only using it to plug my content with copyright notices when bots are reading the page. So if it fails to work for every bot, then that is acceptable.

Likewise, I am trying to move that logic into my Apache config, so the solution would not need to be bullet proof. I really don't even care the tell the difference between a good and bad bot for this particular requirement, only that it is a bot versus a browser. I realize some bots masquerade as browsers, and those I'm willing to let fly... I'm just looking for an 80/20 solution.

That said, do you think I need to build a white-list browser user-agents to help determine if something is a bot? Would that be the easier route to go?

Thanks again for the feedback.