Yahoo? Overture?

Forum Moderators: open

Message Too Old, No Replies

Yahoo? Overture?

mozilla/4.0

fiestagirl

1:38 am on Jun 6, 2006 (gmt 0)

66.228.173.141-154
UA: mozilla/4.0
dp134.data.yahoo.com

Belongs to Overture Services.
Range:66.228.160.0 - 66.228.191.255

No robots.txt. No images. No CSS.
Visiting sites that don't advertise on Overture/Yahoo.

bobothecat

8:12 pm on Jun 6, 2006 (gmt 0)

Yep, seeing it here too. Since it's not using a correct UA, all it's getting is 403's.

66.228.173.147 - - [06/Jun/2006:15:12:51 -0500] "GET / HTTP/1.1" 403 229 "-" "Mozilla/4.0"

I wish they'd stop trying to be so sneaky.

incrediBILL

12:23 am on Jun 8, 2006 (gmt 0)

I wish they'd stop trying to be so sneaky.

Not sure they're being sneaky exactly as they offer many services and you may be seeing something that isn't a spider. Would be nice tho is Yahoo and Google would append the UA with whatever service was issuing the request at least.

bobothecat

1:10 am on Jun 8, 2006 (gmt 0)

you may be seeing something that isn't a spider.

Considering it hit over 15 sites of mine with the same IP range and invalid UA in the same day... I'd say it's a spider :)

... or a very bored Yahoo employee.

fiestagirl

1:13 am on Jun 8, 2006 (gmt 0)

Yeah, half a dozen sites, 1200 requests, one 24 hour period. Walks like a duck...

jdMorgan

8:06 pm on Jun 9, 2006 (gmt 0)

Let's be clear on spiders/crawlers versus robots: While I personally would prefer it if all of them would check/respect robots.txt, the only ones that really need to do so are the spiders/crawlers. The robots that are working from a fixed pre-existing list of URLs may or may not feel the need to check robots.txt because, for example, they are checking a single URL that is listed in their directory, or that you have provided for a PPC landing page or some other purpose.

While these User-agents are robots, they are not crawlers or spiders, they are just link-checkers. So they may consider fetching/checking robots.txt to be a waste of time and their/your bandwidth.

As I stated, I'd prefer it if all automated User-agents would check and obey robots.txt, but I'm a pragmatist and a realist; For a link-checker, I'll concede that it's a waste of time. Ditto WAP proxies and "page accelerators" -- They are not crawlers/spiders or robots and are not human-independent agents, so they shouldn't be expected to check robots.txt.

My purpose here is simply to inject a little clarity into the subject, since it's common to use the terms robots, crawlers and spiders interchangeably. I appreciate that others may have different opinions on whether these things should fetch and obey robots.txt, but I'd just like to clarify the terminology, despite the fact that the result is that I end up saying that all robots should not be expected to check robots.txt -- an apparent contradiction in terms -- and that only those in the Web crawler/spider class of robots really must do so.

However, I do think that all User-agents should identify themselves clearly, and that automated User-agents of any type should provide a link to an info page so we can find out what they are, and what they are doing on our sites, just out of general courtesy.

On the self-identification aspect, see also [webmasterworld.com...]

Jim

bobothecat

8:17 pm on Jun 9, 2006 (gmt 0)

Jim,

Your informational and eloquent post is always appreciated.

Peter

incrediBILL

12:57 am on Jun 10, 2006 (gmt 0)

For a link-checker, I'll concede that it's a waste of time.

Linksmanager tries to crawl my whole site every now any then, nothing wrong with that is there? ;)

I quite disagree as I have many link checkers running against my site with IBL links to thousands of pages and I blocked them all. Enough was enough, if they want to know if the pages exist I can accomodate them with a single file that tells them everything they need to know, but they aren't interested.

I'm not even sure if they are just link checkers, some are for sure, some might not be, but I get hit daily by a bunch of these: Linksmanager, LinkWalker, LinksManager Details Fetcher, Link Validity Check, FindLinks, VERI-LINK, W3C-checklink, and on and on.

jdMorgan

1:18 am on Jun 10, 2006 (gmt 0)

If it crawls a site, it's not a simple link-checker, it's a crawler, regardless of what it calls itself. By "link-checker," I mean an agent that goes through a directory (or a Web page) and checks the links, examples being DMOZ's directory-checker and Xenu Link Sleuth.

That was the point of my post... If it's simply checking a list of links, and has no "discovery" crawling phase, then it's a robot, but not a crawler or spider.

Taking your definition, I'd disagree with myself as well, but I was trying to clarify an accurate definition of "robots" as having two sub-classes; robots/crawlers which have a discovery function versus link checkers which do not. We use a lot of too-loose terminology, and doing so often takes discussions off on non-productive tangents.

Jim

wilderness

12:54 pm on Jun 11, 2006 (gmt 0)

Anybody have a clue when Overture began encrypted refers?
TIA

Yahoo? Overture?

mozilla/4.0

fiestagirl

bobothecat

incrediBILL

bobothecat

fiestagirl

jdMorgan

bobothecat

incrediBILL

jdMorgan

wilderness

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week