Now here's the part that gives your log-wrangling script a workout: How many of those robots went on to read robots.txt and abide by its instructions? Some robots have fooled me by picking it up faithfully on every visit-- and then merrily going wherever they want to go. Some are so entranced by robots.txt-- whose subject matter is not gripping-- that they never get around to picking up any real files. (The bingbot does this consistently. When it does get bored, it varies the menu by landing on the nearest 301. It is a mystery to me how it finds them or why it wants to, since it never follows the redirect. Maybe it's in secret communication with the googlebot.)
And then there was the robot I should by all rights have locked out on sight because it grabbed everything it could lay its hands on, never even pretended to look at robots.txt... but it faithfully obeyed all "nofollow" directives! Score a victory for the belt-and-suspenders principle.
lucy, try whitelisting robots. I use a CGI to let only the majors see the 'complete' version (which matches my sitemap.xml) and then I use mod_rewrite to let them hit only the pages indicated. Everyone else gets a generic full Disallow (and no access to sitemap.xml) unless and until I give them any leeway -- which I rarely ever do because too many are totally untrustworthy.
Oh, also: Bingbot, and other majors, hit many, many times from many, many different servers. So while it may look like they're not retrieving anything beyond r.txt, they are, just not by that server at that time.
Robots get classified as either "No skin off my nose" or "I don't like your face", which bypasses most objective standards ;) Analogously: hotlinkers annoy me on principle-- and they annoy their sites' visitors even more, what with all that download time-- so they're all blocked even though the server load is minuscule overall. But everyone including the grimiest Byelorussian robot is allowed to see the 403 page, because the server weeps if they're not allowed to.*
But if I wanted to be principled about it I'd say that if you whitelist the known robots from approved sources, and lock out everyone else, you've wiped out any chance of someone starting up a genuinely new and interesting search engine. They gotta practice on someone, and their results can't possibly be any stranger than g###'s.
* I don't understand how or why this works, In the error logs, all blocked requests for an interior page-- but not the front page-- are followed with a request for the 403 page. So let's make the error processor happy. And any passing Chinese human can at least see my color scheme ;)
Wow. You're a sympathetic soul! And/or independently wealthy:)
After decades (yikes) of spotting abusive bots, I'm happily hard-nosed about any person or company 'practicing on' -- a.k.a. using and impacting -- my work, my visitors, my server, my bandwidth, my wallet.
If some bot-runner is either too ignorant or imperious to code for reading-and-heeding robots.txt, they can simply stay the heck away.
(Re your asterisk: I don't understand what you don't understand, sorry. A rewrite prob?)
For me, site-specific access control on my old Apache box is via .htaccess, and relying primarily on mod_rewrite (plus some Deny from).
robots.txt is a toothless tiger if it's neither retrieved nor heeded aprés retrieval. I use it, but only via a script that gives the 'full' list to my choice of major SEs, and a full Disallow to absolutely anyone else.
The script I use is a verrry heavily modified version of Lee Killough's officially outdated robots.txt-generating CGI [leekillough.com...]