So if you have a bot disallowed and they ignore it and take files anyway, do you then remove the Deny line as irrelevant or leave it in?
If I'm considering allowing someone in, the first step is to deny them in robots.txt and see if they oblige. No hole-poking unless and until they've passed that test. I test for
at least a month-- longer if it's a rare visitor. So they might not respond to the first robots.txt change, but they've got plenty of subsequent chances. In fact I just recently reviewed the first batch: robots.txt denials added on some date in March, behavior checked at the end of April. This time, two holes were poked, and two others were explicitly flagged as "Continue Blocking". (The latter two continue to be denied in robots.txt on the off chance that they will eventually mend their ways.)
Some visitors don't even get the first test. If they show up out of the blue, with no prior history, I ### well expect them to read robots.txt
before requesting anything else. So if their very first visit involves requests for files in roboted-out directories, they are SOL forever, more or less.
Disclaimer: Since I only changed my access-control system a month or two back, almost all of my baseline information is based on what happened in the old IP-based system. I haven't fully worked out what to do about brand-new robots, since my current default is to ignore any request for robots.txt that's immediately followed by a 403. (The assumption was that these are known non-compliant robots.) But each time I add an ebook to the directory, I check requests for
that specific page for a few days. In fact this is an especially handy test, because a compliant robot's behavior is to first check robots.txt and then, if permitted, request exactly one interior file. No use saying you're working on a cached version of robots.txt if you've never seen it before.
Some even post bot info pages with rhetoric about how to stop their bot and then continue to ignore it.
Yes, I like the ones that politely explain exactly how to deny their UA in robots.txt-- when careful study of logs reveals that neither their stated UA nor anyone else from the same IP has ever
looked at robots.txt