homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Yahoo? Overture?

 1:38 am on Jun 6, 2006 (gmt 0)
UA: mozilla/4.0

Belongs to Overture Services.
Range: -

No robots.txt. No images. No CSS.
Visiting sites that don't advertise on Overture/Yahoo.



 8:12 pm on Jun 6, 2006 (gmt 0)

Yep, seeing it here too. Since it's not using a correct UA, all it's getting is 403's. - - [06/Jun/2006:15:12:51 -0500] "GET / HTTP/1.1" 403 229 "-" "Mozilla/4.0"

I wish they'd stop trying to be so sneaky.


 12:23 am on Jun 8, 2006 (gmt 0)

I wish they'd stop trying to be so sneaky.

Not sure they're being sneaky exactly as they offer many services and you may be seeing something that isn't a spider. Would be nice tho is Yahoo and Google would append the UA with whatever service was issuing the request at least.


 1:10 am on Jun 8, 2006 (gmt 0)

you may be seeing something that isn't a spider.

Considering it hit over 15 sites of mine with the same IP range and invalid UA in the same day... I'd say it's a spider :)

... or a very bored Yahoo employee.


 1:13 am on Jun 8, 2006 (gmt 0)

Yeah, half a dozen sites, 1200 requests, one 24 hour period. Walks like a duck...


 8:06 pm on Jun 9, 2006 (gmt 0)

Let's be clear on spiders/crawlers versus robots: While I personally would prefer it if all of them would check/respect robots.txt, the only ones that really need to do so are the spiders/crawlers. The robots that are working from a fixed pre-existing list of URLs may or may not feel the need to check robots.txt because, for example, they are checking a single URL that is listed in their directory, or that you have provided for a PPC landing page or some other purpose.

While these User-agents are robots, they are not crawlers or spiders, they are just link-checkers. So they may consider fetching/checking robots.txt to be a waste of time and their/your bandwidth.

As I stated, I'd prefer it if all automated User-agents would check and obey robots.txt, but I'm a pragmatist and a realist; For a link-checker, I'll concede that it's a waste of time. Ditto WAP proxies and "page accelerators" -- They are not crawlers/spiders or robots and are not human-independent agents, so they shouldn't be expected to check robots.txt.

My purpose here is simply to inject a little clarity into the subject, since it's common to use the terms robots, crawlers and spiders interchangeably. I appreciate that others may have different opinions on whether these things should fetch and obey robots.txt, but I'd just like to clarify the terminology, despite the fact that the result is that I end up saying that all robots should not be expected to check robots.txt -- an apparent contradiction in terms -- and that only those in the Web crawler/spider class of robots really must do so.

However, I do think that all User-agents should identify themselves clearly, and that automated User-agents of any type should provide a link to an info page so we can find out what they are, and what they are doing on our sites, just out of general courtesy.

On the self-identification aspect, see also [webmasterworld.com...]



 8:17 pm on Jun 9, 2006 (gmt 0)


Your informational and eloquent post is always appreciated.



 12:57 am on Jun 10, 2006 (gmt 0)

For a link-checker, I'll concede that it's a waste of time.

Linksmanager tries to crawl my whole site every now any then, nothing wrong with that is there? ;)

I quite disagree as I have many link checkers running against my site with IBL links to thousands of pages and I blocked them all. Enough was enough, if they want to know if the pages exist I can accomodate them with a single file that tells them everything they need to know, but they aren't interested.

I'm not even sure if they are just link checkers, some are for sure, some might not be, but I get hit daily by a bunch of these: Linksmanager, LinkWalker, LinksManager Details Fetcher, Link Validity Check, FindLinks, VERI-LINK, W3C-checklink, and on and on.


 1:18 am on Jun 10, 2006 (gmt 0)

If it crawls a site, it's not a simple link-checker, it's a crawler, regardless of what it calls itself. By "link-checker," I mean an agent that goes through a directory (or a Web page) and checks the links, examples being DMOZ's directory-checker and Xenu Link Sleuth.

That was the point of my post... If it's simply checking a list of links, and has no "discovery" crawling phase, then it's a robot, but not a crawler or spider.

Taking your definition, I'd disagree with myself as well, but I was trying to clarify an accurate definition of "robots" as having two sub-classes; robots/crawlers which have a discovery function versus link checkers which do not. We use a lot of too-loose terminology, and doing so often takes discussions off on non-productive tangents.



 12:54 pm on Jun 11, 2006 (gmt 0)

Anybody have a clue when Overture began encrypted refers?

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved