ZyBorg/1.0 violates robots.txt - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

ZyBorg/1.0 violates robots.txt

came from 216.88.158.142

«
1
2
3

jazzguy

8:50 pm on Jul 27, 2003 (gmt 0)

10+ Year Member

A bot claiming to be ZyBorg/1.0 disobeyed my robots.txt file, and got itself automatically banned (thanks to the Perl scripts posted in the forums). The disallowed file has been in the robots.txt file for over a month and ZyBorg has fetched robots.txt many times since then.

Is 216.88.158.142 a valid IP for the Zyborg bot, or is somebody spoofing Zyborg's U-A? 216.88.158.142 is assigned to:

OrgName: SAVVIS Communications Corporation
OrgID: SAVV
Address: 1 SAVVIS Parkway
City: Town and Country
StateProv: MO
PostalCode: 63017
Country: US
NetRange: 216.88.0.0 - 216.91.255.255
CIDR: 216.88.0.0/14

There is no reverse DNS configured for 216.88.158.142. The complete U-A of the bot was:
"Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http: //www.WISEnutbot.com)" (I added the space in the URL to prevent linking).

If this is the real ZyBorg bot, I should lift the ban on the IP, right? Isn't Looksmart a desired search engine? If they violate robots.txt, it's going to be a real pain to put in mod_rewrite rules to keep them out of disallowed areas.

GaryK

4:30 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Jim, I know we're discussing two different user agents. However, at least in my case, the DLC ua is checking files that have always been disallowed in robots.txt. To me that suggests at one point the spider did crawl those files. There is obvious evidence in my logs that the spider has visited, and continues to visit, disallowed files.

jeremy goodrich

5:03 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There is no reason - even a 'dead link checking bot' should ignore a robots.txt file. Period.

Doesn't matter if they got the file before or not, they still need to check that it's OK.

After all, it's the traffic that matters to a site, and the content that matters to the search engine. If you don't get the traffic to justify the bandwidth expense of letting a spider in, then by all means, 403 it to someplace else.

warhol

6:20 pm on Sep 19, 2003 (gmt 0)

10+ Year Member

Just another hit'n run for me?

216.88.158.142 - - [18/Sep/2003:16:12:02 -0700] "GET /robots.txt HTTP/1.1" 200 67 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; ht tp://www.WISEnutbot.com)"

Could a XHTML1.0 transitional doctype & layout be causing problems with some spiders? Its W3C valid.

jdMorgan

6:34 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

GaryK,

Assuming that Zyborg DLC's is a link checker whose job is to check links currently in the index, and that Zyborg DLC therefore does not look at robots.txt, then the problem that started this thread is the likely cause of your problem.

The Zyborg robot that had the problem grabbed your disallowed pages, and they ended up in the index. Then, DLC comes along and tries to verify the index, so it accesses those disallowed pages again.

I think their implementation is a bit weak, in that DLC should either read and obey robots.txt, or it should do HEAD requests instead of GETs if they want to classify it as a link-checker only and not as a robot.

However, I still would not apply a blanket user-agent ban to a well-known company's robot. I'd use .htaccess or ISAPI filters or something similar to block the specific problem pages until they get the problem sorted out.

But I haven't had any problems with Zyborg. Maybe those here who have had a problem should write it up, attach a short log file sample and e-mail it to Looksmart as a problem report. I had good luck recently with another company - I actually got a reply from someone 'famous' at director level thanking me for the info. So, some of them listen. I very wasn't happy with what happened because of this bug, but decided to help them out in return for the clicks they'd sent me over the past years.

Jim
<edited for typo>

[edited by: jdMorgan at 8:01 pm (utc) on Sep. 19, 2003]

GaryK

6:50 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

However, I still would not apply a blanket user-agent ban to a well-known company's robot. I'd use .htaccess or ISAPI filters or something similar to block the specific problem pages until they get the problem sorted out.

In your opinion how long should it take to resolve the problem of the spider accessing disallowed files/folders before it gets banned? This thread started on July 27, nearly two months have passed, the company is clearly aware of the problem, and yet the problem persists to this very day. IMO if they were serious about fixing this bug it could have been done overnight. But if you feel differently about it I'll consider rethinking my position on the subject because you've earned my respect.

jazzguy

7:02 pm on Sep 19, 2003 (gmt 0)

10+ Year Member

Maybe those here who have had a problem should write it up, attach a short log file sample and e-mail it to Looksmart as a problem report.

I tried that way back around the time this thread started. Sent it to the email address they supplied in their user-agent string. They did not reply.

jdMorgan

7:57 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Well, we've been over this ground before, and it's been touched on again in a post above.

It's a fact that people make mistakes, and robots misbehave because of it. It's up to you to decide if you want to ban a robot because it is misbehaving, implement a work-around allowing time for them to fix it, or ignore the problem completely. As Jeremy noted above, it is up to the individual webmaster to make this decision, based on his or her site, the traffic it gets from the search engine in question, and the complexity of the problem.

My personal opinion - which should be taken with a grain of salt - is that in this case, a work-around is preferable. You just never know what will happen next month in this business, or who will be supplying search results to whom. Banning a 'brand-name' robot when more specific measures are possible is just not a good idea -- again, IMHO.

As to how long it takes to fix a problem, it depends. Maybe it's a simple coding problem, but then what about testing? A search engine's index is it's product. I would not expect them to release a new robot version with the potential capability of destroying their index without a good long test and evaluation. Since it often takes months to get a site spidered and listed, I'd take that investment into consideration when deciding what your pre-ban time limit will be.

Perhaps a fix is on the way: claus has spotted two new Zyborg variants [webmasterworld.com], and I found a new one myself today. We think they're new, anyway.

When I have pages that I don't want listed in the SERPs, I add them to robots.txt. But I also add them to rules in my .htaccess file, as insurance against the kind of problem described here. If the robot obeys robots.txt, it never sees any effect from .htaccess. But if it has a problem, the .htaccess code will deliver a 403-Forbidden response on a per-resource basis. I do this for the well-known robots that send me traffic. Any robot I don't recognize does not get this selective treatment, though.

My main point is that when dealing with search engines which send you traffic, a "shoot first and ask questions later" approach may hurt you more than it ever hurts them. You'll lose that traffic, someone else will move up a position in their results, and hardly anyone but you and that webmaster will notice.

Jim

GaryK

8:15 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I appreciate the points you've made, Jim. Perhaps I'll wait another week or two before I add ZyBorg to the "website strippers" category in my browscap.ini file. I wish IIS made it easier to implement something similar to a *nix .htaccess file.

chance

3:15 pm on Sep 20, 2003 (gmt 0)

10+ Year Member

I have been following this thread for over a month trying to figure out how to stop Zyborg. I run a small (60 visitors a day) non-commercial info/research site from my home using a w2k pro box and freeware Abyss web server. After exceeding my monthly bandwidth a few months ago I found this forum and added robots.txt which stopped many of the "bad bots mentioned here. Zyborg and Grub are the only ones coming to my site now that do not obey robots.txt.

Three days ago I found Sygate Personal Firewall which has an IP Ban capability and is very easy to configure. This is the only method I have been able to come up with for a Windows/non-Apache set-up since I don't know Perl, PHP or any other language for that matter.

Zyborg may be a biggy in the robots world but I don't want them on my site every 7 minutes. Checking the last 2 months logs, Zybor has not looked at robots.txt once.

I did e-mail Looksmart and recieved an inane reply which was very pleasant but said nothing.

Great Forum you have here. As an amature I really appreciate and use it and have learned a WHOLE bunch.

Thnx
Chance

GaryK

8:33 pm on Sep 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I want to let everyone know that most of the problems I've been having with ZyBorg were due to moving my websites to a new server and not doing a good enough job of verifying that everything was working properly, especially my custom error handling.

Daniele at Looksmart has been a pleasure to work with and my opinion of the company has improved considerably.

I have no plans to add ZyBorg to the "website strippers" category of my browscap.ini file but I will continue to monitor my logs for any possible problems.

-gary.

This 70 message thread spans 3 pages: 70

«
1
2
3