On their web page "Yes. While we do crawl the home page of your site, we do not crawl beyond that if your robots.txt file prohibits it." How it does that without actually getting the robots.txt file? Unless it is using a totally different UA and IP address (which is bad practice).
blend27
10:48 pm on Jul 24, 2010 (gmt 0)
In a similar fasion, 20 stories UP, 173.203.71.246 is causing some mess on one of my sites for the past couple of days. Requesting everything from phpinfo.php - /cgi-bin/cgihelper.cgi to sending binary data in request body. pesky little fellow.
Dijkgraaf
12:44 am on Aug 3, 2010 (gmt 0)
Well this one isn't that pesky, as it has only asked for the root page so far. Just visited me again with the new UA as mentioned above, still no robots.txt
devitnow
7:46 pm on Oct 22, 2010 (gmt 0)
Hello All,
This is my bot and by reading this thread it looks like I need to get it to check the robots.txt file. I wasn't sure if I really had to since I'm only visiting the homepage.
Thanks and see you all at Pubcon! Jeff
Pfui
12:45 am on Oct 23, 2010 (gmt 0)
The standard Disallow --
User-agent: * Disallow: /
-- means everything is off-limits, including home pages, so thanks in advance for coding your bot to read and heed robots.txt, ditto robots META tags.
FWIW:
The Robots Exclusion Standard, a.k.a. the Robots Exclusion Protocol, dates back to the mid 1990s. [en.wikipedia.org...] See also: "The Web Robots Pages" [robotstxt.org...]
Pfui
1:05 am on Dec 11, 2010 (gmt 0)
@devitnow, I'm eagerly awaiting your bot finally reading/heeding robots.txt, etc. You're up to v1.9 now, and sporting another name change --
mail.flightdeckreports.com FlightDeckReports Bot 1.9 beta (http://www.flightdeckreports.com/bot) robots.txt? NO
-- so here's hoping your very next update will respect the standard (& sites that do). Thank you!