Looksmart ignoring robots.txt

Forum Moderators: DixonJones

Message Too Old, No Replies

Looksmart ignoring robots.txt

is a "Dead Link Checker" an excuse?

WebJoe

9:57 am on Jan 2, 2004 (gmt 0)

In the last couple of days, looksmarts Zyborg bot has fallen into my bad bot-trap.
The user agent given is

Mozilla/4.0 compatible ZyBorg/1.0 Dead Link Checker (wn.zyborg@looksmart.net; http.//www.WISEnutbot.com)

and the IP logged confirms that this is real. I can imagine that someone would argue that as a dead link checker, it has to follow all links. But there is a reason who I don't want bots in certain places!

Am I the only one noticing this, or am I just the only one that bugs this?

What also confuses me is this: I'm serving the trap-page correctly with a 200 and everything. But they keep coming back (almost every day the last 10 days). How much more alive do they want this link? I haven't checked, but if the grab all my pages on that basis (daily) then I have to think about banning them, I can't handle that much traffic just because looksmart wants to look smart...

Opinions/comments welcome.

pendanticist

4:36 pm on Jan 3, 2004 (gmt 0)

I got news for 'ya. DLC is about as stupid this past year as it's been. ;)

With a high degree of regularity I see DLC repeatedly ask for the same dead link it looked for the last time. And....as with every other time (and there have been many), it again asks for the same dead link.

Thing about those dead links is that I put up a series of 301 redirects to resolve them (re-structured site - pathways changed) well over a year ago. To date, LS is the only one to NOT follow it's own logic.

WebJoe

7:42 pm on Jan 3, 2004 (gmt 0)

Thanks pendanticist for that answer, I kinda expected it to be you to reply to my post first...

The thing is though, that it is NOT a dead link...but a link off limits by robots.txt, and the first time in almost two years that it's requested by a bot that belongs to a known SE.

But from your statement I take it that this is not unusual for this bot, and by the time frame you mentioned there is no hope that it will change to a better or at least acceptable way.

Thanks, and happy new year

pendanticist

4:16 am on Jan 4, 2004 (gmt 0)

You're Welcome Webjoe. :)

By rights, it's important to communicate with the bot owner with the hopes that they'll modify it. Sometimes they do/will and sometimes they don't even honour you with a reply.

I know 'stupid' isn't the proper way to describe the bot, but if I start talking about the designers I'd get real upset and since bots are sorta inanimate...Stupid is as Stupid does. <g>

I can only tolerate this 'hostage' activity for a certain time and then the claws are gonna come out...

If I brought out the claws now, I'd risk losing the traffic. Therefore, I am a hostage who is forced to simply ignore it.

Visi

4:51 am on Jan 4, 2004 (gmt 0)

Is this the grub client, From their website, the Grub client provides dead-link checking for WiseNut and LookSmart.

Might want to try banning this agent?

Stefan

5:10 am on Jan 4, 2004 (gmt 0)

It's this sucker:

2004-01-02 07:31:42 216.88.158.142 GET /page.htm 200 0 174 www.site.org Mozilla/4.0+compatible+ZyBorg/1.0+Dead+Link+Checker+(wn.zyborg@looksmart.net;+http://www.WISEnutbot.com) -

It's been all over my site recently.

WebJoe

11:47 am on Jan 4, 2004 (gmt 0)

@pendanticist: I did send an email (Sa 03.01.2004 20:52 GMT+1) to the address provided, let's see if and what they reply...I'll post the results here.

@Visi: No, it's not grub - I've had problems with that one too. At least I don't think so by looking at the UA-string I posted and Stefan confirmed. If I'd ban (thru htaccess or ISAPI-rewrite resp.) I'd run the risks pendanticist mentioned: noticably losing traffic

Stefan

3:20 pm on Jan 4, 2004 (gmt 0)

let's see if and what they reply...I'll post the results here

Good stuff, WebJoe. I'll be very interested in what they have to say. I really don't need 160 pages being crawled every day to look for non-existent dead links, (which is none of their business anyway).

WebJoe

9:22 pm on Jan 8, 2004 (gmt 0)

Got a reply from looksmart:

[...]
We do support the Robot Exclusion protocol, and aim to refresh our robot rule data for each host on a weekly basis. This dead link checker only runs against URLs that are in the WiseNut index, so if it is violating your /robots.txt file, the crawler that collects the information for indexing may be also.
[...]

I provided them with log-extracts and a copy of my robots.txt to prove all of the above wrong, as in my case I wasn't able to find one page of my entire site at wisenut, and three days ago the bot grabbed robots.txt and 4:45 hours later went for the trap (in a disallowed directory)

curious what the next reply is

jdMorgan

11:28 pm on Jan 8, 2004 (gmt 0)

WebJoe,

What they are trying to tell you is that DLC is not a robot - it does not crawl. It is a list-checker. It checks URLs that they already have in their list to see if the URL is still good.

Generally, this means that at the time they generated the list, the file was not excluded in your robots.txt, or that your robots.txt had syntax errors at that time, or that their crawler had a bug at that time.

They ought to fix this, of course, but the main problem is that to which pendanticist refers - you can feed them a 301 or a 404 or 410 for a long, long time before they finally take your word for it that the resource in moved, not found or gone.

They need to respond to 301, 302, 403, 404, and 410 quicker and correctly, and they need to quit doing GETs and start doing Conditional GETS or HEADs. But they don't need to check robots.txt for what they are doing.

My main point here is to clarify that the Dead Link Checker is not a robot as defined by the Standard for Robots Exclusion, and so should not be expected to fetch and obey robots.txt. This is not an excuse, this is a statement that if necessary, you should take steps that do not rely on the user-agent's voluntary cooperation, as robots.txt does.

Jim
<edit> speling </edit>

WebJoe

11:07 am on Jan 10, 2004 (gmt 0)

Thanks jdMorgan for that clarification. As I pointed out to my contact at Looksmart, the one page in question has always been disallowed by robots.txt, and my robots.txt validated at all times. So the list-generating bot ignored robots.txt or had a bug interpreting it.
(Plus I was not able to find any of the pages of the site in question in Looksmart or WiseNut)

That does not matter anymore, as they (Looksmart) agreed that it is a bug in the software and promised to fix the bug and inform me when they did.

Problem solved - as soon as the patch is applied.

WebJoe

7:24 pm on Jan 14, 2004 (gmt 0)

Just got a note from Looksmart saying they fixed the bug but it's gonna take some time until it filters through the system