Welcome to WebmasterWorld Guest from 22.214.171.124
126.96.36.199 - - [06/Jun/2006:15:12:51 -0500] "GET / HTTP/1.1" 403 229 "-" "Mozilla/4.0"
I wish they'd stop trying to be so sneaky.
you may be seeing something that isn't a spider.
Considering it hit over 15 sites of mine with the same IP range and invalid UA in the same day... I'd say it's a spider :)
... or a very bored Yahoo employee.
While these User-agents are robots, they are not crawlers or spiders, they are just link-checkers. So they may consider fetching/checking robots.txt to be a waste of time and their/your bandwidth.
As I stated, I'd prefer it if all automated User-agents would check and obey robots.txt, but I'm a pragmatist and a realist; For a link-checker, I'll concede that it's a waste of time. Ditto WAP proxies and "page accelerators" -- They are not crawlers/spiders or robots and are not human-independent agents, so they shouldn't be expected to check robots.txt.
My purpose here is simply to inject a little clarity into the subject, since it's common to use the terms robots, crawlers and spiders interchangeably. I appreciate that others may have different opinions on whether these things should fetch and obey robots.txt, but I'd just like to clarify the terminology, despite the fact that the result is that I end up saying that all robots should not be expected to check robots.txt -- an apparent contradiction in terms -- and that only those in the Web crawler/spider class of robots really must do so.
However, I do think that all User-agents should identify themselves clearly, and that automated User-agents of any type should provide a link to an info page so we can find out what they are, and what they are doing on our sites, just out of general courtesy.
On the self-identification aspect, see also [webmasterworld.com...]
Your informational and eloquent post is always appreciated.
For a link-checker, I'll concede that it's a waste of time.
Linksmanager tries to crawl my whole site every now any then, nothing wrong with that is there? ;)
I quite disagree as I have many link checkers running against my site with IBL links to thousands of pages and I blocked them all. Enough was enough, if they want to know if the pages exist I can accomodate them with a single file that tells them everything they need to know, but they aren't interested.
I'm not even sure if they are just link checkers, some are for sure, some might not be, but I get hit daily by a bunch of these: Linksmanager, LinkWalker, LinksManager Details Fetcher, Link Validity Check, FindLinks, VERI-LINK, W3C-checklink, and on and on.
That was the point of my post... If it's simply checking a list of links, and has no "discovery" crawling phase, then it's a robot, but not a crawler or spider.
Taking your definition, I'd disagree with myself as well, but I was trying to clarify an accurate definition of "robots" as having two sub-classes; robots/crawlers which have a discovery function versus link checkers which do not. We use a lot of too-loose terminology, and doing so often takes discussions off on non-productive tangents.