Forum Moderators: open
You think this a problem with my dirt cheap hosting server or a problem with the Yahoo name servers? Anyone else ever see Yahoo Slurp without a valid hostname?
Cheers,
Phred
Reason I ask is that I'm seeing requests which rDNS to llf320032.crawl.yahoo.net, but using the Linux Firefox BonEcho user-agent string:
67.195.37.108 - - [24/Jul/2008:23:15:50 -0400] "GET /mainstyle.css HTTP/1.0" 200 2501 "http://www.example.com/mydir/mypage.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4"
Seems to take an interest in the CSS file for each page it fetches, so my money's on an anti-cloaking 'bot.
Jim
7/25/2008 8:36:41 AM----67.195.37.160----/page_a.asp ----Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4----
7/25/2008 8:46:36 AM----67.195.37.160----/page_b.asp ----Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
It only asks for the page, no other files.
Haven't made my mind up yet how to treat it but if it's an anti-cloaking bot so much for them cloaking their normal UA as if their IP isn't a give away for who they are.
I'd rather stay off their "hand-check" list, because I don't entirely trust the "hands" that do the checking to make the right call. It depends too much on how smart/how well-trained they are to figure out that there is no intent to deceive anyone -- just a highly-customized "user experience." Having been booted by an over-zealous and evidently-untrained checker at MSN/Live, I'd rather keep my Yahoo traffic. :)
BonEcho, is a "code-name" for an old pre-release Firefox browser.
Jim
I haven't seen that. What was the user-agent string?
UA was:
Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)
From 74.6.17.168
robots.txt at 19:45:14 - no hostname - therefore ip blocked and a 403
webpage at 19:46:26 - no hostname - 403 because ip previously blocked
webpage at 19:47:52 - good hostname (llf..) - 403 because ip previously blocked
then from 67.195.37.95
robots.txt at 01:25:23 - good hostname
Now from today:
67.195.37.95 (all good hostnames)
robots.txt at 02:07:33 UA of: Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)
webpage at 02:07:34 UA of: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4
So in this case robots.txt with "normal" Yahoo UA and the crawl with the X11 UA.
Phred
my bot blocker has stopped 5K hits by "BonEcho" in the last couple of days from a single IP.
Without giving away any secrets - was it just not white-listed, did it do something sorta naughty, or was it outright being a very bad bot? I'm close to not allowing ua's containing bonecho and outright stopping Yahoo when from coming from the same ip it changes ua's between reading robots.txt and then hitting the website.
Cheers,
Phred
[webmasterworld.com...]
Seeing almost simultaneous, daily, hits by slupr and bonecho, for the same pages. I'm done with non-Inktomi Yahoo anything - Yahoo/Slurp/BonEcho/llf/ in ua's, and hostnames - for robots or web page access.
I haven't seen any real Slurp hits outside of these ranges. This about it for a Yahoo ip white list?
INKTOMI-BLK-3 66.196.64.0/18 (66.196.64.0 - 66.196.127.255)
INKTOMI-BLK-4 68.142.192.0/18 (68.142.192.0 - 68.142.255.255)
INKTOMI-BLK-5 72.30.0.0/16 (72.30.0.0 - 72.30.255.255)
INKTOMI-BLK-6 74.6.0.0/16 (74.6.0.0 - 74.6.255.255)
Phred
I'm done with non-Inktomi Yahoo anything
phred,
Your certainly perceptive enough to determine what is benefical to your own websites.
However denying all of Yahoo is overkil, at least IMO.
You might try utilizing a conditional deny based on BOTH the IP's that you've provided and the UA that I specified here:
[webmasterworld.com...]
It hasn't effected my listings with Yahoo, nor should it yours.
Don
Wget/1.10.2 (Red Hat modified)
I've had this before and fished them out of the IP blacklist. Over 30 IPs blocked this time. I'm tempted to leave them blocked.
The crawler hit one file about a dozen times, 10 second gap. It then did the same again to the same file. Since the file is a contact form it's banned in robots.txt so another black mark. It then went on to hit another file several dozen times, again about 10 secs apart.