Msg#: 4311585 posted 10:03 pm on May 12, 2011 (gmt 0)
#1 The Good A while back I had a visit from an exquisitely well-behaved robot. Before anything else, it picked up robots.txt and assimilated its contents. It then went around reading all pages and following all "<a href" links, carefully omitting anything inside a Disallowed directory. It preceded each GET with a HEAD, and spaced its visits an average of 5 seconds apart. I was so gratified that I didn't even investigate its credentials. They could be the most evil people in the world; they've got a polite robot and it's welcome any time.
(Aside: It did mystify me by trying to find a batch of nonexistent index pages, but this turned out to be my fault. I'd recently added links in one directory and, er, goofed in the addresses. Thanks, robot!)
Timing wasn't ideal, though, because just a few days later I changed my mind about one directory, disallowed it in robots.txt and added "nofollow" to all its links. (Belt and suspenders principle.)
#2 The bad Within hours, an unrelated robot drifted by and picked up a few random pages from the now-disallowed directory. Later still, it picked up the revised robots.txt. Ten hours later it came by again and picked up nothing but four pages in the disallowed directory.
My .htaccess file now contains this line (obfuscation done manually because text editor's rotate-13 is broken):
(The syntax of the rule looks wrong to me, but it's the only thing I could find that works.)
#3 The above-the-law Had it been any other robot, it would have gone straight into the "Deny from" list. Evidently they are outsourcing their robots.txt handling, rather than processing it on the spot like the Good Robot above. Grr.