homepage Welcome to WebmasterWorld Guest from 54.226.0.225
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Yahoo Slurp - no hostname
phred




msg:3706048
 7:41 am on Jul 24, 2008 (gmt 0)

Part of my validation when bots access robots.txt is to do a IP --> hostname lookup and then do a hostname --> IP lookup. Yahoo Slurp has always passed the test before but last night it didnít. Good IP (74.6.17.nnn) but hostname lookup failed. So, he got bounced and the IP got blocked. 30 seconds later he hits the website but of course the IP is now blocked (still no valid hostname lookup). Then 60 seconds later hits the website again, IP is still block so gets bounced again, but this time the hostname lookup yields the normal llf520nnn.crawl.yahoo.net hostname.

You think this a problem with my dirt cheap hosting server or a problem with the Yahoo name servers? Anyone else ever see Yahoo Slurp without a valid hostname?

Cheers,
Phred

 

jdMorgan




msg:3707218
 1:22 pm on Jul 25, 2008 (gmt 0)

I haven't seen that. What was the user-agent string?

Reason I ask is that I'm seeing requests which rDNS to llf320032.crawl.yahoo.net, but using the Linux Firefox BonEcho user-agent string:

67.195.37.108 - - [24/Jul/2008:23:15:50 -0400] "GET /mainstyle.css HTTP/1.0" 200 2501 "http://www.example.com/mydir/mypage.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4"

Seems to take an interest in the CSS file for each page it fetches, so my money's on an anti-cloaking 'bot.

Jim

Staffa




msg:3707240
 1:48 pm on Jul 25, 2008 (gmt 0)

I'm currently also watching the behaviour of this Linux Firefox BonEcho from Y! It comes from IP numbers that are equally used for Slurp crawling.

7/25/2008 8:36:41 AM----67.195.37.160----/page_a.asp ----Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4----
7/25/2008 8:46:36 AM----67.195.37.160----/page_b.asp ----Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

It only asks for the page, no other files.

Haven't made my mind up yet how to treat it but if it's an anti-cloaking bot so much for them cloaking their normal UA as if their IP isn't a give away for who they are.

wilderness




msg:3707244
 1:51 pm on Jul 25, 2008 (gmt 0)

Jim,
I've had a OS portion of the UA denied for some years (its been discussed here previously).

Most of the major SE's use a similar OS UA for what ever reasons.

It has yet to effect either my listings or the normal crawls.

jdMorgan




msg:3707260
 2:08 pm on Jul 25, 2008 (gmt 0)

Yes, but since I *do* cloak --or serve user-agent-dependent content, to be more precise (e.g. bot-traps, mobile-device detection, alternate CSS files for IE vs. "other," etc.)-- I thought I'd mention it here.

I'd rather stay off their "hand-check" list, because I don't entirely trust the "hands" that do the checking to make the right call. It depends too much on how smart/how well-trained they are to figure out that there is no intent to deceive anyone -- just a highly-customized "user experience." Having been booted by an over-zealous and evidently-untrained checker at MSN/Live, I'd rather keep my Yahoo traffic. :)

BonEcho, is a "code-name" for an old pre-release Firefox browser.

Jim

phred




msg:3707481
 5:14 pm on Jul 25, 2008 (gmt 0)

I haven't seen that. What was the user-agent string?

UA was:

Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)

From 74.6.17.168

robots.txt at 19:45:14 - no hostname - therefore ip blocked and a 403
webpage at 19:46:26 - no hostname - 403 because ip previously blocked
webpage at 19:47:52 - good hostname (llf..) - 403 because ip previously blocked

then from 67.195.37.95

robots.txt at 01:25:23 - good hostname

Now from today:

67.195.37.95 (all good hostnames)

robots.txt at 02:07:33 UA of: Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)

webpage at 02:07:34 UA of: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4

So in this case robots.txt with "normal" Yahoo UA and the crawl with the X11 UA.

Phred

incrediBILL




msg:3707761
 10:38 pm on Jul 25, 2008 (gmt 0)

I'm not sure what they're doing but my bot blocker has stopped 5K hits by "BonEcho" in the last couple of days from a single IP.

phred




msg:3707972
 6:24 am on Jul 26, 2008 (gmt 0)

my bot blocker has stopped 5K hits by "BonEcho" in the last couple of days from a single IP.

Without giving away any secrets - was it just not white-listed, did it do something sorta naughty, or was it outright being a very bad bot? I'm close to not allowing ua's containing bonecho and outright stopping Yahoo when from coming from the same ip it changes ua's between reading robots.txt and then hitting the website.

Cheers,
Phred

incrediBILL




msg:3707987
 7:09 am on Jul 26, 2008 (gmt 0)

Only SLURP UA's are whitelisted from Yahoo so anything non-SLURP gets torched.

phred




msg:3708392
 11:28 pm on Jul 26, 2008 (gmt 0)

For those interested here's a thread from nov/dec 2007 re: Yahoo lurking around using one UA to get robots and another (bonecho) to scrape. Also keyplr noted that bonecho tried to get a denied directory.

[webmasterworld.com...]

Seeing almost simultaneous, daily, hits by slupr and bonecho, for the same pages. I'm done with non-Inktomi Yahoo anything - Yahoo/Slurp/BonEcho/llf/ in ua's, and hostnames - for robots or web page access.

I haven't seen any real Slurp hits outside of these ranges. This about it for a Yahoo ip white list?

INKTOMI-BLK-3 66.196.64.0/18 (66.196.64.0 - 66.196.127.255)
INKTOMI-BLK-4 68.142.192.0/18 (68.142.192.0 - 68.142.255.255)
INKTOMI-BLK-5 72.30.0.0/16 (72.30.0.0 - 72.30.255.255)
INKTOMI-BLK-6 74.6.0.0/16 (74.6.0.0 - 74.6.255.255)

Phred

wilderness




msg:3708397
 11:56 pm on Jul 26, 2008 (gmt 0)

I'm done with non-Inktomi Yahoo anything

phred,
Your certainly perceptive enough to determine what is benefical to your own websites.

However denying all of Yahoo is overkil, at least IMO.

You might try utilizing a conditional deny based on BOTH the IP's that you've provided and the UA that I specified here:

[webmasterworld.com...]

It hasn't effected my listings with Yahoo, nor should it yours.

Don

incrediBILL




msg:3708403
 12:13 am on Jul 27, 2008 (gmt 0)

Phred, most cases when you see Firefox UA being used by a SE it's making screen shots.

Do with that info as you will.

wilderness




msg:3708422
 12:55 am on Jul 27, 2008 (gmt 0)

This about it for a Yahoo ip white list?

Dan keeps [iplists.com] this current

Staffa




msg:3708522
 9:00 am on Jul 27, 2008 (gmt 0)

when you see Firefox UA being used by a SE it's making screen shots.

Bill, any idea what they use these screen shots for ?

incrediBILL




msg:3708860
 11:47 pm on Jul 27, 2008 (gmt 0)

Bill, any idea what they use these screen shots for ?

If in fact they are screenshots, I would assume they plan on adding them to the search results just like Ask and Snap did.

incrediBILL




msg:3708867
 12:11 am on Jul 28, 2008 (gmt 0)

FYI, BonEcho has asked for over 7K pages in the last couple of days now.

That's way more data than they would need to check for a little cloaking.

dstiles




msg:3712638
 3:13 am on Aug 1, 2008 (gmt 0)

Got several hits from Yahoo crawler hlf*crawl on the 74.6.13.* range this evening with a wget UA, which is banned on my servers...

Wget/1.10.2 (Red Hat modified)

I've had this before and fished them out of the IP blacklist. Over 30 IPs blocked this time. I'm tempted to leave them blocked.

The crawler hit one file about a dozen times, 10 second gap. It then did the same again to the same file. Since the file is a contact form it's banned in robots.txt so another black mark. It then went on to hit another file several dozen times, again about 10 secs apart.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved