Forum Moderators: open
Is 216.88.158.142 a valid IP for the Zyborg bot, or is somebody spoofing Zyborg's U-A? 216.88.158.142 is assigned to:
OrgName: SAVVIS Communications Corporation
OrgID: SAVV
Address: 1 SAVVIS Parkway
City: Town and Country
StateProv: MO
PostalCode: 63017
Country: US
NetRange: 216.88.0.0 - 216.91.255.255
CIDR: 216.88.0.0/14
There is no reverse DNS configured for 216.88.158.142. The complete U-A of the bot was:
"Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http: //www.WISEnutbot.com)" (I added the space in the URL to prevent linking).
If this is the real ZyBorg bot, I should lift the ban on the IP, right? Isn't Looksmart a desired search engine? If they violate robots.txt, it's going to be a real pain to put in mod_rewrite rules to keep them out of disallowed areas.
I do agree however, in the case of a usually well-behaved, respected bot as Zyborg, it would have been more logical (less emotional lol) to contact them about the problem first (as Frontpage mentioned).
Thanks
Linda
Just checked my logs and
Agent: Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)...]
is trying to grab pages in areas that I've never seen search engines look at before lol.
I'm getting tons of 403's.
Oh well.
Frontpage nailed it in post #17 - The first step is to contact the organization deploying the robot, and report problems. If they do nothing, or if they claim they will correct bad behaviour and then fail to do so, then ban 'em.
[...]try to help these folks fix the problems with the 'bot and with the broken robots information link in the user-agent[...]
I said...
(* added to break the link to their non-existant bot page...)
I hope you all consider the possible ramifications of banning Looksmart[...]
I'm willing to bet that some webmasters have... ;) If you don't ban it outright, leave your ZyBorg entry in robots.txt as is, and the enforce robots.txt on ZyBorg, whether it likes it or not...
robots.txtUser-agent: ZyBorg
Disallow: /stay_away.html
Disallow: /bad_dir/.htaccess
RewriteCond %{HTTP_USER_AGENT} ZyBorg
RewriteCond %{REQUEST_URI} ^/stay_away\.html [OR]
RewriteCond %{REQUEST_URI} ^/bad_dir
RewriteRule .* - [F]
ZyBorg continues to read our robots.txt files, but until the time comes that ZyBorg fully understands it, we do the work - that is banning - for it. Once we hear reports that ZyBorg is correcting operating, just remove the relevant section from your .htaccess file if you're trusting, leave it if your paranoid. (Or use this technique on all spiders you welcome, if you're really paranoid.)
If you don't ban it outright, leave your ZyBorg entry in robots.txt as is, and the enforce robots.txt on ZyBorg, whether it likes it or not [...snip .htaccess...]
That's exactly the technique I use for good bots. They are, however, still susceptible to my rogue bot trap which is what ZyBorg triggered. I don't exempt good bots from the trap in order to prevent malicious bots from spoofing the UA to get around it.
Found this tonight, zyborg looking for unknown file, does anyone know the significant reason for viewing a clients file?
While it's (somewhat) unusual for ZyBorg (or any other robot) to be requesting a non-existant file, you may be reading too much into the filename... I wonder how many professionals around here have a "clients.html", or similiar, page describing, well, their clients. A fair number, I'll guess...
You haven't had a "clients.html" file before, have you?
216.88.158.142 - - [02/Aug/2003:03:45:33 -0400] "GET /robots.txt HTTP/1.1" 403 - "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]
In my case it looks like some prohibited URL's were added to the crawl list several months ago (in violation of the robots.txt instructions), and have never been re-checked against the robots.txt file since then.
This morning, my "new-IP-watcher" script sent me an e-mail:
"The spider "Mozilla/4.0 compatible ZyBorg/1.0 Beta.112 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]
Has a NEW IP address: 216.88.158.142"
alerting me of a "new" (to me anyway) Looksmart spider IP.
It only read my robots.txt file (which doesn't disallow it), and then left. Perfectly, well-behaved so far.
In this case, there is no "DLC" (dead-link checker?) designation.
So is this particular beta still ONLY a link checker?
Or is it doing some indexing?
I think I'll watch it for awhile.
I have a newer site thats been visited 3 times lately by Zyborg, 2 different IP's. It checks for my non-existent robots.txt file & leaves? Its a simple site w/no private areas so I didn't bother w/one.
Are anyones logs showing any of these going any deeper?
I'd really like to get this whole site indexed by it, it keeps kickin me to the curb though =)
[07/Sep/2003:10:18:51 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)" 209.216.190.193
[10/Sep/2003:12:33:38 -0700] "GET /robots.txt HTTP/1.1" 404 344 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)" 216.39.48.101
[11/Sep/2003:06:07:02 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0(wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)" 216.39.48.101
Thanx ;) Greg
Anyway, is calling a robots.txt file & then leaving normal behaviour for Zyborg?
Are some of these new engines "Requiring" a robots.txt file by chance?
From the sound of this thread, I should block it & it'd drill to the end LOL
No cloaking either, nothing even questionable on this one.
Thanx a bunch ;)
216.88.158.142 - - [07/Sep/2003:10:18:51 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]
216.88.158.142 - - [10/Sep/2003:12:33:38 -0700] "GET /robots.txt HTTP/1.1" 404 344 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]
216.88.158.142 - - [11/Sep/2003:06:07:02 -0700] "GET /robots.txt HTTP/1.1" 404 348 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; [WISEnutbot.com)"...]
If nothing else, posting a blank robots.txt on your site would save you logging all those 404's.
Or just put the standard "allow all" robots.txt up:
User-agent: *
Disallow:
Is anyone still seeing problems w/WiseNut? I just got crawled today with no problems vis-a-vis robots.txt.
Jim
I was wondering about doing that?
I have 2 hits from Inktomi's Slurp that did the exact same thing too? requested robots.txt & took off?
66.196.65.34 - - [12/Sep/2003:21:46:46 -0700] "GET /robots.txt HTTP/1.0" 404 332 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.34 - - [15/Sep/2003:09:02:26 -0700] "GET /robots.txt HTTP/1.0" 404 332 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
Thats why I was asking if this is normal behaviour when some bots first come to your site? or is it lack of a robots.txt file?
Google & AltaVista are indexing the whole site now regularly. about 30 pages. AOL is caching some, just got listed in DMOZ, no search results yet though.
I've only been working on it for 5-6 weeks off & on a bit, so I can't complain really.
Wisenut & Inktomi are 2 that I'd really like to get into though ;)
Is that blank robots.txt file exactly as you posted? just a User-agent: * wildcard & a blank disallow: plus a blank line underneath?
I wasn't sure exactly how to make a blank one & was nervous being unsure. Except for Zyborg & Slurp, its been going great.
Thanx for any insight!
I know how to build optimized sites, kinda new at in-depth SEO though. ;)
> Is that blank robots.txt file exactly as you posted? just a User-agent: * wildcard & a blank disallow: plus a blank line underneath?
>I wasn't sure exactly how to make a blank one & was nervous being unsure. Except for Zyborg & Slurp, its been going great.
Well, a blank robots.txt is just a blank file named robots.txt. Most search engines will assume that you want the whole site spidered, and it will save you a bunch of 404 errors.
However, I'm a big fan of following the specifications, and as such, I recommend you put up the "correct" robots.txt file as shown above; User-agent: * directive, blank Disallow: directive, blank line. That's all there is to it... Now if you want to get fancy, have a look here [webmasterworld.com]. :)
Jim
I went with this for now & just validated it...
1 User-agent: *
2 Disallow: /logs/
3
4 User-agent: WebZip
5 Disallow: /
6
I saw it mentioned that a blank Disallow: without a / might be a bit risky as of late? some may read it as disallow all, so I just blocked my logs.
I had a visit by someone & 'ol WebZip a couple weeks ago too. I had forgotten about it until I saw it in your article. I had planned on banning it then. That things RclickZilla =)
I'll watch & see if Zyborg & Slurp break their hit&run pattern now with a robots.txt file?
I'll be around. This SEO is fun for a change. Its a big cyber-chess game ;)
Cheers, Greg
You have those records reversed.
All robots will read the first record and then quit, because "*" matches all robots. Robot control is organized by user-agent name, not by file name. Therefore, a robot will read records only until it finds a record with its user-agent name in it or a wild-card "*", whichever it finds first. Assuming it's a good robot, it will read and obey only that one record, and no others will have any effect. Records are separated by the blank line.
So, you need to reverse them:
User-agent: WebZip
Disallow: /
User-agent: *
Disallow: /logs/
Jim
like most days ZyBorg violated my robots.txt:
216.88.158.142 - - [17/Sep/2003:11:41:45 +0200] "GET /directory/page-to-redirect.htm HTTP/1.1" 301 338 "-" "Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; http*://www.WISEnutbot.com)"
216.88.158.142 - - [17/Sep/2003:11:47:10 +0200] "GET /error/redirected.htm HTTP/1.1" 200 1471 "-" "Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; http*://www.WISEnutbot.com)"
In this case it crabbed a page that gets redirected using mod_rewrite (R=301), a few minutes later it crabbed the error page for the redirect, the error folder is excluded in robots.txt of course.
That's enough, give it the 403 it deserves, as I don't want my error pages to appear in a search engine!
Konny
When I checked my logs for the past week they were filled with robots.txt violations for this user agent including something similar to what KonnyQ mentioned.
IMO this problem should have been fixed by now so I don't think it's going to be fixed. Also, I'm embarrassed to have certain files, like my custom error handler, being indexed. So it's time to act.
I know that normally the choice to ban a user agent is up to each webmaster. But for those of you who use my browscap.ini file please note I will be moving this user agent from the "search engines" category to the "website strippers" category in the next release. If you use my script to ban "website strippers" this will ban ZyBorg.
This one is a spider:
Mozilla/4.0 compatible ZyBorg/1.0 (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)
This one is a dead-link checker:
Mozilla/4.0 compatible ZyBorg/1.0 DLC (wn.zyborg@looksmart.net; http ://www.WISEnutbot.com)
The dead link checker does not spider, it checks links that it already has. Therefore, no robots.txt
That's what I've observed, anyway. YMMV.
Jim
Analysing logs kind of sucks these days, there's just too much monkey business around, no wonder when webmasters flip there lid once a while. ;-)
After reading this thread a fifth time I've banned ZyBorg only from accessing the disallowed files as mentioned by balam in Post #34.
Other than that I'll keep watching it anyhow!
Happy day,
Konny