I haven't seen that. What was the user-agent string?
Reason I ask is that I'm seeing requests which rDNS to llf320032.crawl.yahoo.net, but using the Linux Firefox BonEcho user-agent string:
18.104.22.168 - - [24/Jul/2008:23:15:50 -0400] "GET /mainstyle.css HTTP/1.0" 200 2501 "http://www.example.com/mydir/mypage.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:22.214.171.124) Gecko/20080721 BonEcho/126.96.36.199"
Seems to take an interest in the CSS file for each page it fetches, so my money's on an anti-cloaking 'bot.
I'm currently also watching the behaviour of this Linux Firefox BonEcho from Y! It comes from IP numbers that are equally used for Slurp crawling.
7/25/2008 8:36:41 AM----188.8.131.52----/page_a.asp ----Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:184.108.40.206) Gecko/20080721 BonEcho/220.127.116.11----
7/25/2008 8:46:36 AM----18.104.22.168----/page_b.asp ----Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
It only asks for the page, no other files.
Haven't made my mind up yet how to treat it but if it's an anti-cloaking bot so much for them cloaking their normal UA as if their IP isn't a give away for who they are.
I've had a OS portion of the UA denied for some years (its been discussed here previously).
Most of the major SE's use a similar OS UA for what ever reasons.
It has yet to effect either my listings or the normal crawls.
Yes, but since I *do* cloak --or serve user-agent-dependent content, to be more precise (e.g. bot-traps, mobile-device detection, alternate CSS files for IE vs. "other," etc.)-- I thought I'd mention it here.
I'd rather stay off their "hand-check" list, because I don't entirely trust the "hands" that do the checking to make the right call. It depends too much on how smart/how well-trained they are to figure out that there is no intent to deceive anyone -- just a highly-customized "user experience." Having been booted by an over-zealous and evidently-untrained checker at MSN/Live, I'd rather keep my Yahoo traffic. :)
BonEcho, is a "code-name" for an old pre-release Firefox browser.
|I haven't seen that. What was the user-agent string? |
Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)
robots.txt at 19:45:14 - no hostname - therefore ip blocked and a 403
webpage at 19:46:26 - no hostname - 403 because ip previously blocked
webpage at 19:47:52 - good hostname (llf..) - 403 because ip previously blocked
then from 22.214.171.124
robots.txt at 01:25:23 - good hostname
Now from today:
126.96.36.199 (all good hostnames)
robots.txt at 02:07:33 UA of: Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)
webpage at 02:07:34 UA of: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:188.8.131.52) Gecko/20080721 BonEcho/184.108.40.206
So in this case robots.txt with "normal" Yahoo UA and the crawl with the X11 UA.
I'm not sure what they're doing but my bot blocker has stopped 5K hits by "BonEcho" in the last couple of days from a single IP.
|my bot blocker has stopped 5K hits by "BonEcho" in the last couple of days from a single IP. |
Without giving away any secrets - was it just not white-listed, did it do something sorta naughty, or was it outright being a very bad bot? I'm close to not allowing ua's containing bonecho and outright stopping Yahoo when from coming from the same ip it changes ua's between reading robots.txt and then hitting the website.
Only SLURP UA's are whitelisted from Yahoo so anything non-SLURP gets torched.
For those interested here's a thread from nov/dec 2007 re: Yahoo lurking around using one UA to get robots and another (bonecho) to scrape. Also keyplr noted that bonecho tried to get a denied directory.
Seeing almost simultaneous, daily, hits by slupr and bonecho, for the same pages. I'm done with non-Inktomi Yahoo anything - Yahoo/Slurp/BonEcho/llf/ in ua's, and hostnames - for robots or web page access.
I haven't seen any real Slurp hits outside of these ranges. This about it for a Yahoo ip white list?
INKTOMI-BLK-3 220.127.116.11/18 (18.104.22.168 - 22.214.171.124)
INKTOMI-BLK-4 126.96.36.199/18 (188.8.131.52 - 184.108.40.206)
INKTOMI-BLK-5 220.127.116.11/16 (18.104.22.168 - 22.214.171.124)
INKTOMI-BLK-6 126.96.36.199/16 (188.8.131.52 - 184.108.40.206)
|I'm done with non-Inktomi Yahoo anything |
Your certainly perceptive enough to determine what is benefical to your own websites.
However denying all of Yahoo is overkil, at least IMO.
You might try utilizing a conditional deny based on BOTH the IP's that you've provided and the UA that I specified here:
It hasn't effected my listings with Yahoo, nor should it yours.
Phred, most cases when you see Firefox UA being used by a SE it's making screen shots.
Do with that info as you will.
|This about it for a Yahoo ip white list? |
Dan keeps [iplists.com] this current
|when you see Firefox UA being used by a SE it's making screen shots. |
Bill, any idea what they use these screen shots for ?
|Bill, any idea what they use these screen shots for ? |
If in fact they are screenshots, I would assume they plan on adding them to the search results just like Ask and Snap did.
FYI, BonEcho has asked for over 7K pages in the last couple of days now.
That's way more data than they would need to check for a little cloaking.
Got several hits from Yahoo crawler hlf*crawl on the 74.6.13.* range this evening with a wget UA, which is banned on my servers...
Wget/1.10.2 (Red Hat modified)
I've had this before and fished them out of the IP blacklist. Over 30 IPs blocked this time. I'm tempted to leave them blocked.
The crawler hit one file about a dozen times, 10 second gap. It then did the same again to the same file. Since the file is a contact form it's banned in robots.txt so another black mark. It then went on to hit another file several dozen times, again about 10 secs apart.