ZyBorg/1.0 violates robots.txt - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

ZyBorg/1.0 violates robots.txt

came from 216.88.158.142

«
1
2
3
»

jazzguy

8:50 pm on Jul 27, 2003 (gmt 0)

10+ Year Member

A bot claiming to be ZyBorg/1.0 disobeyed my robots.txt file, and got itself automatically banned (thanks to the Perl scripts posted in the forums). The disallowed file has been in the robots.txt file for over a month and ZyBorg has fetched robots.txt many times since then.

Is 216.88.158.142 a valid IP for the Zyborg bot, or is somebody spoofing Zyborg's U-A? 216.88.158.142 is assigned to:

OrgName: SAVVIS Communications Corporation
OrgID: SAVV
Address: 1 SAVVIS Parkway
City: Town and Country
StateProv: MO
PostalCode: 63017
Country: US
NetRange: 216.88.0.0 - 216.91.255.255
CIDR: 216.88.0.0/14

There is no reverse DNS configured for 216.88.158.142. The complete U-A of the bot was:
"Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http: //www.WISEnutbot.com)" (I added the space in the URL to prevent linking).

If this is the real ZyBorg bot, I should lift the ban on the IP, right? Isn't Looksmart a desired search engine? If they violate robots.txt, it's going to be a real pain to put in mod_rewrite rules to keep them out of disallowed areas.

Peeress

11:19 am on Aug 1, 2003 (gmt 0)

10+ Year Member

re:
"jdMorgan: In the long run, we need (well-behaved) robots to visit our sites in order to get traffic much more than any search engine needs to list our individual sites in their index. If they can't list a few sites because their 'bot is banned, it really won't affect their search results much; It will be unimportant... Except to the webmasters of those sites. IOW, we need the 'bots much more than they need our sites. "
----------------------------
That is the very reason I banned it. Search engines are not the only way to promote websites and I think we place too much importance and are too dependent on them.
Speaking for myself, I do not need the bots more than they need my site.
imo -
I am too, too aware that there are millions of sites out there and that mine is just a small drop in the bucket, but, basically,
--- If it wasn't for our websites, and our 'allowing' bots to crawl through our sites, these search engines would not 'have' a search engine.---
(but that's another topic)

I do agree however, in the case of a usually well-behaved, respected bot as Zyborg, it would have been more logical (less emotional lol) to contact them about the problem first (as Frontpage mentioned).

Thanks

Linda

Peeress

1:27 pm on Aug 1, 2003 (gmt 0)

10+ Year Member

and later on this morning:

Just checked my logs and
Agent: Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)...]

is trying to grab pages in areas that I've never seen search engines look at before lol.
I'm getting tons of 403's.

Oh well.

jazzguy

3:58 pm on Aug 1, 2003 (gmt 0)

10+ Year Member

jdMorgan wrote:

Frontpage nailed it in post #17 - The first step is to contact the organization deploying the robot, and report problems. If they do nothing, or if they claim they will correct bad behaviour and then fail to do so, then ban 'em.

After first posting here to try to verify if it was a legit Looksmart bot, I contacted them using the email address given in their user-agent string. That was at the beginning of the week when this first happened. Still no response. They appear to have corrected the bug, but if they are going to offer an email address, it would be nice if they would respond to questions asked.

balam

4:24 pm on Aug 1, 2003 (gmt 0)

10+ Year Member

Thus spoke jdMorgan:

[...]try to help these folks fix the problems with the 'bot and with the broken robots information link in the user-agent[...]

I said...

(* added to break the link to their non-existant bot page...)

...meaning I added an asterisk to break the link to www.wisenutbot.com, not that the link itself is actually broken - sorry for the confusion. Visiting wisenutbot.com presents the same page as visiting wisenut.com - the search interface. I have no clue when the change occurred, but previously there was a page on ZyBorg. (Perhaps it disappeared after Looksmart took over? I hadn't visited the page myself since then, so...?) I highly suspect that Wisenut is aware of this, so that's why I saw no need to bring it to their attention, but thought it worth mentioning to us. :)

I hope you all consider the possible ramifications of banning Looksmart[...]

I'm willing to bet that some webmasters have... ;) If you don't ban it outright, leave your ZyBorg entry in robots.txt as is, and the enforce robots.txt on ZyBorg, whether it likes it or not...

robots.txt
User-agent: ZyBorg
Disallow: /stay_away.html
Disallow: /bad_dir/
.htaccess
RewriteCond %{HTTP_USER_AGENT} ZyBorg
RewriteCond %{REQUEST_URI} ^/stay_away\.html [OR]
RewriteCond %{REQUEST_URI} ^/bad_dir
RewriteRule .* - [F]

ZyBorg continues to read our robots.txt files, but until the time comes that ZyBorg fully understands it, we do the work - that is banning - for it. Once we hear reports that ZyBorg is correcting operating, just remove the relevant section from your .htaccess file if you're trusting, leave it if your paranoid. (Or use this technique on all spiders you welcome, if you're really paranoid.)

jazzguy

6:06 pm on Aug 1, 2003 (gmt 0)

10+ Year Member

balam wrote:

If you don't ban it outright, leave your ZyBorg entry in robots.txt as is, and the enforce robots.txt on ZyBorg, whether it likes it or not [...snip .htaccess...]

That's exactly the technique I use for good bots. They are, however, still susceptible to my rogue bot trap which is what ZyBorg triggered. I don't exempt good bots from the trap in order to prevent malicious bots from spoofing the UA to get around it.

davelms

7:20 pm on Aug 1, 2003 (gmt 0)

10+ Year Member

They appear to have corrected the bug

My logs from this morning still show ZyBorg/1.0 from 216.88.158.142 to be disobeying my robots.txt file, so the impression I have is that the bug has not been corrected.

relic

4:15 am on Aug 2, 2003 (gmt 0)

216.88.158.142 - - [01/Aug/2003:15:27:15 -0400] "GET /clients.html

Found this tonight, zyborg looking for unknown file, does anyone know the significant reason for viewing a clients file?

balam

5:18 am on Aug 2, 2003 (gmt 0)

10+ Year Member

Hey there relic, welcome to Webmaster World!

While it's (somewhat) unusual for ZyBorg (or any other robot) to be requesting a non-existant file, you may be reading too much into the filename... I wonder how many professionals around here have a "clients.html", or similiar, page describing, well, their clients. A fair number, I'll guess...

You haven't had a "clients.html" file before, have you?

cyberkat

12:55 pm on Aug 2, 2003 (gmt 0)

10+ Year Member

Getting hit with that robot for over a week now. Never obeys, so I banned it yesterday(8/1/2003). Now it eat the 403.

216.88.158.142 - - [02/Aug/2003:03:45:33 -0400] "GET /robots.txt HTTP/1.1" 403 - "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]

wilderness

7:40 pm on Aug 7, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

from a IAR newsletter:
1. LookSmart's Deep Listings Option For Small Businesses
No longer reserved for big businesses -- with big bucks. Is LookSmart right for you?
http ://www.clickz.com/search/opt/article.php/2244991

RonPK

8:20 pm on Aug 14, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I haven't seen any bad behaviour by Wisenut since August 2. Looks like the bug really has been fixed.

jomaxx

5:08 pm on Aug 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The out-of-control behaviour seems to have been fixed, but they continue to spidering URL's in my prohibited directories. Last occasion was this morning; id was "Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; [WISEnutbot.com)"....]

In my case it looks like some prohibited URL's were added to the crawl list several months ago (in violation of the robots.txt instructions), and have never been re-checked against the robots.txt file since then.

Key_Master

3:49 am on Aug 24, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I've been monitoring Wisenutbot since it first learned to crawl. The robots.txt problems began to occur late March of this year and have continued to this date. The Wisenut info page which had offered robots.txt instructions on preventing access to the Wisenutbot has also been removed. Personally, I doubt Looksmart has any respect for the sites it indexes nor any real motivation for preventing the violations from reoccuring in the future. I vote ban 'em.

Friday

3:34 pm on Aug 29, 2003 (gmt 0)

10+ Year Member

I'm glad I did a search before posting. A lot of my questions have been answered in this thread.

This morning, my "new-IP-watcher" script sent me an e-mail:

"The spider "Mozilla/4.0 compatible ZyBorg/1.0 Beta.112 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]
Has a NEW IP address: 216.88.158.142"

alerting me of a "new" (to me anyway) Looksmart spider IP.

It only read my robots.txt file (which doesn't disallow it), and then left. Perfectly, well-behaved so far.

In this case, there is no "DLC" (dead-link checker?) designation.
So is this particular beta still ONLY a link checker?
Or is it doing some indexing?

I think I'll watch it for awhile.

warhol

9:29 pm on Sep 16, 2003 (gmt 0)

10+ Year Member

Hey guys ;)

I have a newer site thats been visited 3 times lately by Zyborg, 2 different IP's. It checks for my non-existent robots.txt file & leaves? Its a simple site w/no private areas so I didn't bother w/one.

Are anyones logs showing any of these going any deeper?
I'd really like to get this whole site indexed by it, it keeps kickin me to the curb though =)

[07/Sep/2003:10:18:51 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)" 209.216.190.193

[10/Sep/2003:12:33:38 -0700] "GET /robots.txt HTTP/1.1" 404 344 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)" 216.39.48.101

[11/Sep/2003:06:07:02 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0(wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)" 216.39.48.101

Thanx ;) Greg

fiestagirl

2:26 am on Sep 17, 2003 (gmt 0)

10+ Year Member

Warhol are you sure about those log entries?
216.39.48.101 belongs to Alta Vsita and the other to an isp in Oregon. Maybe someone is spoofing that user agent to see if you're cloaking?

warhol

3:02 am on Sep 17, 2003 (gmt 0)

10+ Year Member

Hey Fiesta Girl ;)
What can I say...I'm a noob at cruising these logs w/out an analyser =) These are right, I grabbed the IP's off the end instead of the beginning.

Anyway, is calling a robots.txt file & then leaving normal behaviour for Zyborg?
Are some of these new engines "Requiring" a robots.txt file by chance?
From the sound of this thread, I should block it & it'd drill to the end LOL
No cloaking either, nothing even questionable on this one.

Thanx a bunch ;)

216.88.158.142 - - [07/Sep/2003:10:18:51 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]

216.88.158.142 - - [10/Sep/2003:12:33:38 -0700] "GET /robots.txt HTTP/1.1" 404 344 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]

216.88.158.142 - - [11/Sep/2003:06:07:02 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]

jdMorgan

3:44 am on Sep 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

warhol,

If nothing else, posting a blank robots.txt on your site would save you logging all those 404's.

Or just put the standard "allow all" robots.txt up:

User-agent: * Disallow:

Leave a blank line at the end.

Is anyone still seeing problems w/WiseNut? I just got crawled today with no problems vis-a-vis robots.txt.

Jim

warhol

5:01 am on Sep 17, 2003 (gmt 0)

10+ Year Member

Hey jdMorgan,

I was wondering about doing that?
I have 2 hits from Inktomi's Slurp that did the exact same thing too? requested robots.txt & took off?
66.196.65.34 - - [12/Sep/2003:21:46:46 -0700] "GET /robots.txt HTTP/1.0" 404 332 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.34 - - [15/Sep/2003:09:02:26 -0700] "GET /robots.txt HTTP/1.0" 404 332 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]

Thats why I was asking if this is normal behaviour when some bots first come to your site? or is it lack of a robots.txt file?

Google & AltaVista are indexing the whole site now regularly. about 30 pages. AOL is caching some, just got listed in DMOZ, no search results yet though.

I've only been working on it for 5-6 weeks off & on a bit, so I can't complain really.
Wisenut & Inktomi are 2 that I'd really like to get into though ;)

Is that blank robots.txt file exactly as you posted? just a User-agent: * wildcard & a blank disallow: plus a blank line underneath?
I wasn't sure exactly how to make a blank one & was nervous being unsure. Except for Zyborg & Slurp, its been going great.

Thanx for any insight!
I know how to build optimized sites, kinda new at in-depth SEO though. ;)

jdMorgan

5:18 am on Sep 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

warhol,

> Is that blank robots.txt file exactly as you posted? just a User-agent: * wildcard & a blank disallow: plus a blank line underneath?
>I wasn't sure exactly how to make a blank one & was nervous being unsure. Except for Zyborg & Slurp, its been going great.

Well, a blank robots.txt is just a blank file named robots.txt. Most search engines will assume that you want the whole site spidered, and it will save you a bunch of 404 errors.

However, I'm a big fan of following the specifications, and as such, I recommend you put up the "correct" robots.txt file as shown above; User-agent: * directive, blank Disallow: directive, blank line. That's all there is to it... Now if you want to get fancy, have a look here [webmasterworld.com]. :)

Jim

warhol

6:04 am on Sep 17, 2003 (gmt 0)

10+ Year Member

Thanx again, good article, good forum too ;)

I went with this for now & just validated it...

1 User-agent: *
2 Disallow: /logs/
3
4 User-agent: WebZip
5 Disallow: /
6

I saw it mentioned that a blank Disallow: without a / might be a bit risky as of late? some may read it as disallow all, so I just blocked my logs.
I had a visit by someone & 'ol WebZip a couple weeks ago too. I had forgotten about it until I saw it in your article. I had planned on banning it then. That things RclickZilla =)

I'll watch & see if Zyborg & Slurp break their hit&run pattern now with a robots.txt file?

I'll be around. This SEO is fun for a change. Its a big cyber-chess game ;)

Cheers, Greg

Peeress

9:38 pm on Sep 17, 2003 (gmt 0)

10+ Year Member

Is anyone still seeing problems w/WiseNut? I just got crawled today with no problems vis-a-vis robots.txt.
Jim

For me, lately, Zyborg is behaving perfectly, looks at robots.txt and follows it to the letter.

killroy

11:02 pm on Sep 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

What exactly is Zyborg suposed to do. It must've spidered over 300 pages on my site at least once per week for over amonth, and still only my homepage is listed.

SN

jdMorgan

11:17 pm on Sep 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

warhol,

You have those records reversed.

All robots will read the first record and then quit, because "*" matches all robots. Robot control is organized by user-agent name, not by file name. Therefore, a robot will read records only until it finds a record with its user-agent name in it or a wild-card "*", whichever it finds first. Assuming it's a good robot, it will read and obey only that one record, and no others will have any effect. Records are separated by the blank line.

So, you need to reverse them:

User-agent: WebZip
Disallow: /
 
User-agent: *
Disallow: /logs/

If you find a robot that does not properly handle Disallow: with a blank argument, report it. I prefer to code to spec and apply work-arounds only when needed.

Jim

jdMorgan

11:20 pm on Sep 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

killroy.

WiseNut only upates every 90 days or so. I normally wouldn't worry about it, but who knows which robots will be important next year, with all the mergers and acquisitions going on...

Jim

warhol

11:45 pm on Sep 17, 2003 (gmt 0)

10+ Year Member

Thanx jd ;) got it!

KonnyQ

12:05 pm on Sep 18, 2003 (gmt 0)

10+ Year Member

jim,

like most days ZyBorg violated my robots.txt:

216.88.158.142 - - [17/Sep/2003:11:41:45 +0200] "GET /directory/page-to-redirect.htm HTTP/1.1" 301 338 "-" "Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; http*://www.WISEnutbot.com)"
216.88.158.142 - - [17/Sep/2003:11:47:10 +0200] "GET /error/redirected.htm HTTP/1.1" 200 1471 "-" "Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; http*://www.WISEnutbot.com)"

In this case it crabbed a page that gets redirected using mod_rewrite (R=301), a few minutes later it crabbed the error page for the redirect, the error folder is excluded in robots.txt of course.

That's enough, give it the 403 it deserves, as I don't want my error pages to appear in a search engine!

Konny

GaryK

10:23 pm on Sep 18, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I can't belive I missed this thread. Someone was kind enough to alert me to it.

When I checked my logs for the past week they were filled with robots.txt violations for this user agent including something similar to what KonnyQ mentioned.

IMO this problem should have been fixed by now so I don't think it's going to be fixed. Also, I'm embarrassed to have certain files, like my custom error handler, being indexed. So it's time to act.

I know that normally the choice to ban a user agent is up to each webmaster. But for those of you who use my browscap.ini file please note I will be moving this user agent from the "search engines" category to the "website strippers" category in the next release. If you use my script to ban "website strippers" this will ban ZyBorg.

jdMorgan

2:47 am on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Before y'all get too excited, please note that there are two different user-agents being discussed here in this thread, and it doesn't seem like anyone is making a point of it.

This one is a spider:
Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)

This one is a dead-link checker:
Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)

The dead link checker does not spider, it checks links that it already has. Therefore, no robots.txt

That's what I've observed, anyway. YMMV.

Jim

KonnyQ

7:39 am on Sep 19, 2003 (gmt 0)

10+ Year Member

Thanks Jim for pointing us (me) again on the two different UAs.

Analysing logs kind of sucks these days, there's just too much monkey business around, no wonder when webmasters flip there lid once a while. ;-)

After reading this thread a fifth time I've banned ZyBorg only from accessing the disallowed files as mentioned by balam in Post #34.

Other than that I'll keep watching it anyhow!

Happy day,
Konny

This 70 message thread spans 3 pages: 70

«
1
2
3
»