Forum Moderators: open
At least it asked for robots.txtAs I read this, I had a belated D’oh! moment and realized that if I added one more environmental variable, I could let robots.txt default to Disallow everywhere. Otherwise, if it’s a brand-new robot, you can’t really fault them for not seeing their name in robots.txt, and therefore assuming they’re OK to go in. (Unless they head straight for a Disallowed directory, in which case I can proceed directly to bad_agent.)
No ReferrerAnyone who does send a referer for robots.txt gets the global Disallow, because I know they’re lying.
3.238.21.abc - - [05/Jun/2024:11:34:07 -0700] "GET /robots.txt HTTP/1.1" 200 3944 "-" "Clickagy Intelligence Bot v2"Now, if they’re one of those obscure robots that’s just checking whether a site exists--or whether it has a robots.txt--then I guess that request is as good as any. Or, hm, you could put together a database of which named robots are most often Disallowed, which I don’t doubt would be useful for someone somewhere.