incrediBILL - 9:05 am on Mar 25, 2012 (gmt 0)
How do you distinguish between a browser and a robot? You can't
You can to a degree.
Web security is built in layers and you peel away bots one layer at a time, line peeling an onion.
First the robots.txt layer for the good guys that honor robots.txt, but it's whitelisted as well so you get the maximum bang for the buck by stopping as much here as possible so they don't keep burning more server resources.
Second, the .htaccess or script layer to forcibly filter out all the bad guys that ignore robots.txt. You obviously know the bots that announce themselves and the user agents that claim to be browsers you let into the next level, best you can do in layer 2.
Third, filter by IP policy, firewall any browsers from commercial locations, like hosting farm IP ranges or in reverse, block robots coming from residential IP ranges.
Fourth, filter by behavior, header content, loads CSS, js, images, etc.
Fifth, filter by volume, how fast and how many pages, does it skew human behavior
So on and so forth...
Then there are other policies and rules in place that keep stripping out bot behavior vs human behavior until it's as good as it gets and in the end bots still sneak in, but it's much less than it was before.