Forum Moderators: open
I just don’t get it. Why read Robots.txt if later not obeying it?
Weird things do happen: it is possible the bot never received that robots.txt in full, even though in your log you see the fact that it was requested. Maybe (even likely) it was a bug in robots.txt parsing, maybe even you made a mistake assuming that if Google obeys your robots.txt then that robots.txt is actually standard compliant: typically experienced bot writers will include a number of detection of typical webmaster errors to infer meaning of robots.txt directives rather than following standard to the letter.
WWW is not a perfect system where everything functions with 100% reliability - if a bot writer made attempt to read robots.txt then at least they are trying, perhaps they got a bug on their end, perhaps it is you (no one is infallible in this world): according to the link in user-agent this bot was made by "University of British Columbia's Laboratory for Computational Intelligence", hardly an evil organisation - remember that Google started in University too, give good chaps a break - contact them with your problem rather than just posting somewhere where they probably not even look: the problem might actually lie in "heritrix" codebase used by Internet Archive - they open sourced it.
I have no relation to the bot, just felt it was worthwhile answering the question based on my fairly extensive experience in crawling the Weird World Web :)
[edited by: Lord_Majestic at 2:28 pm (utc) on June 8, 2007]
maybe
---it was a bug in robots.txt parsing
maybe
-- perhaps they got a bug on their end
maybe
-- contact them with your problem
i don't have one
---give good chaps a break---
I do, but not on site that clearly states
User-agent : *
Disallow: /trap
when something gets in to the /trap I whack the good and bad guys on the same level, whether it's Google, Yahoo(in fact with in last 5 minutes), WORIO or any other yo-yo,
The WORIO spider showed tipical scraper behaviour based on my script that tracks them and got blocked.
BTW, I've seen it doing so on more than one site, just simply stating the fact
don't get me wrong I love all of them.
:)