joined:Nov 5, 2005
1.) Who, What, Where, When, How
Many bots crawl from non-obvious Hosts/IPs using multiple, including cloaked, agents. Thus if an apparent newcomer already knows your file paths, there's a strong likelihood they've been to your site before.
Alternatively, if your file paths are 'visible' via robots.txt, and/or you've not specifically denied caching in page-based HTML, .htaccess, and/or via at least some of the majors' webmaster tools, again everything's visible.
2.) When 'No' Doesn't Mean No
Many bots could care less about 403s, ditto many individual browser add-ons, link-checkers, file -downloaders, and users. The former is bad programming, imho. The latter can simply be clueless, or in too many cases, compromised.
My Solution (ymmv): When I'm hit by regularly 403-ignoring visitors of any kind, I rewrite them to 127.0.0.1. Then, if they're relentless or their hit rate's too rapid, I send a Cease-and-Desist (C&D) to the ISP. I'm often surprised how frequently the latter works. (Alas, in the case of notorious ISPs like theplanet or amazonaws, don't hold your breath.)
In extreme cases, you can place a firewall rule against them if you have the means, or, depending on your server software, you can deny them from the get-go so denials waste the least amount of resources and don't clog your site-level logs.
If the offending Host/IP/bot is notorious -- search Goo, projecthoneypot.org, this forum's posts, etc. -- don't sweat locking out the address. However, if it's someone clueless, sooner or later they may revisit and realize what havoc they've been wreaking. That's why I wait awhile between 403s (with my e-address in graphic form) and oblivion (127.0.0.1).
For example, in the case of Safari's 'Top Sites' feature, the browser learns to revisit oft' visited pages on launch. (This supposedly cool code thing is a MAJOR headache on dynamic sites because indices get hit countless times/day for no real-time purpose whatsoever.) Anyway, before you kill because of 403-abuse, make sure it's not just a regular visitor's browser doing its thing.
If the preceding is more geek than not, you'll find info and how-to details about the majority of the preceding options, in the appropriate forums here, e.g., Apache Web Server, and specifically their Library docs.