homepage Welcome to WebmasterWorld Guest from 54.163.72.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
the good, the bad and the above the law?
lucy24




msg:4311587
 10:03 pm on May 12, 2011 (gmt 0)

#1 The Good
A while back I had a visit from an exquisitely well-behaved robot. Before anything else, it picked up robots.txt and assimilated its contents. It then went around reading all pages and following all "<a href" links, carefully omitting anything inside a Disallowed directory. It preceded each GET with a HEAD, and spaced its visits an average of 5 seconds apart. I was so gratified that I didn't even investigate its credentials. They could be the most evil people in the world; they've got a polite robot and it's welcome any time.

(Aside: It did mystify me by trying to find a batch of nonexistent index pages, but this turned out to be my fault. I'd recently added links in one directory and, er, goofed in the addresses. Thanks, robot!)

Timing wasn't ideal, though, because just a few days later I changed my mind about one directory, disallowed it in robots.txt and added "nofollow" to all its links. (Belt and suspenders principle.)

#2 The bad
Within hours, an unrelated robot drifted by and picked up a few random pages from the now-disallowed directory. Later still, it picked up the revised robots.txt. Ten hours later it came by again and picked up nothing but four pages in the disallowed directory.

My .htaccess file now contains this line (obfuscation done manually because text editor's rotate-13 is broken):

RewriteCond %{HTTP_USER_AGENT} Tbbtyrobg [OR]
RewriteCond %{REMOTE_ADDR} 72\.14\.\d+\.\d+
RewriteRule silence/ - [F]

(The syntax of the rule looks wrong to me, but it's the only thing I could find that works.)

#3 The above-the-law
Had it been any other robot, it would have gone straight into the "Deny from" list. Evidently they are outsourcing their robots.txt handling, rather than processing it on the spot like the Good Robot above. Grr.

 

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved