Welcome to WebmasterWorld Guest from 23.22.220.37

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Yahoo Slurp - no hostname

     
7:41 am on Jul 24, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:May 11, 2008
posts:55
votes: 0


Part of my validation when bots access robots.txt is to do a IP --> hostname lookup and then do a hostname --> IP lookup. Yahoo Slurp has always passed the test before but last night it didnít. Good IP (74.6.17.nnn) but hostname lookup failed. So, he got bounced and the IP got blocked. 30 seconds later he hits the website but of course the IP is now blocked (still no valid hostname lookup). Then 60 seconds later hits the website again, IP is still block so gets bounced again, but this time the hostname lookup yields the normal llf520nnn.crawl.yahoo.net hostname.

You think this a problem with my dirt cheap hosting server or a problem with the Yahoo name servers? Anyone else ever see Yahoo Slurp without a valid hostname?

Cheers,
Phred

1:22 pm on July 25, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


I haven't seen that. What was the user-agent string?

Reason I ask is that I'm seeing requests which rDNS to llf320032.crawl.yahoo.net, but using the Linux Firefox BonEcho user-agent string:

67.195.37.108 - - [24/Jul/2008:23:15:50 -0400] "GET /mainstyle.css HTTP/1.0" 200 2501 "http://www.example.com/mydir/mypage.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4"

Seems to take an interest in the CSS file for each page it fetches, so my money's on an anti-cloaking 'bot.

Jim

1:48 pm on July 25, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


I'm currently also watching the behaviour of this Linux Firefox BonEcho from Y! It comes from IP numbers that are equally used for Slurp crawling.

7/25/2008 8:36:41 AM----67.195.37.160----/page_a.asp ----Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4----
7/25/2008 8:46:36 AM----67.195.37.160----/page_b.asp ----Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

It only asks for the page, no other files.

Haven't made my mind up yet how to treat it but if it's an anti-cloaking bot so much for them cloaking their normal UA as if their IP isn't a give away for who they are.

1:51 pm on July 25, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Jim,
I've had a OS portion of the UA denied for some years (its been discussed here previously).

Most of the major SE's use a similar OS UA for what ever reasons.

It has yet to effect either my listings or the normal crawls.

2:08 pm on July 25, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Yes, but since I *do* cloak --or serve user-agent-dependent content, to be more precise (e.g. bot-traps, mobile-device detection, alternate CSS files for IE vs. "other," etc.)-- I thought I'd mention it here.

I'd rather stay off their "hand-check" list, because I don't entirely trust the "hands" that do the checking to make the right call. It depends too much on how smart/how well-trained they are to figure out that there is no intent to deceive anyone -- just a highly-customized "user experience." Having been booted by an over-zealous and evidently-untrained checker at MSN/Live, I'd rather keep my Yahoo traffic. :)

BonEcho, is a "code-name" for an old pre-release Firefox browser.

Jim

5:14 pm on July 25, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:May 11, 2008
posts:55
votes: 0


I haven't seen that. What was the user-agent string?

UA was:

Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)

From 74.6.17.168

robots.txt at 19:45:14 - no hostname - therefore ip blocked and a 403
webpage at 19:46:26 - no hostname - 403 because ip previously blocked
webpage at 19:47:52 - good hostname (llf..) - 403 because ip previously blocked

then from 67.195.37.95

robots.txt at 01:25:23 - good hostname

Now from today:

67.195.37.95 (all good hostnames)

robots.txt at 02:07:33 UA of: Mozilla/5.0 (compatible; Yahoo! Slurp; httx://help.yahoo.com/help/us/ysearch/slurp)

webpage at 02:07:34 UA of: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4

So in this case robots.txt with "normal" Yahoo UA and the crawl with the X11 UA.

Phred

10:38 pm on July 25, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


I'm not sure what they're doing but my bot blocker has stopped 5K hits by "BonEcho" in the last couple of days from a single IP.
6:24 am on July 26, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:May 11, 2008
posts:55
votes: 0


my bot blocker has stopped 5K hits by "BonEcho" in the last couple of days from a single IP.

Without giving away any secrets - was it just not white-listed, did it do something sorta naughty, or was it outright being a very bad bot? I'm close to not allowing ua's containing bonecho and outright stopping Yahoo when from coming from the same ip it changes ua's between reading robots.txt and then hitting the website.

Cheers,
Phred

7:09 am on July 26, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Only SLURP UA's are whitelisted from Yahoo so anything non-SLURP gets torched.
11:28 pm on July 26, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:May 11, 2008
posts:55
votes: 0


For those interested here's a thread from nov/dec 2007 re: Yahoo lurking around using one UA to get robots and another (bonecho) to scrape. Also keyplr noted that bonecho tried to get a denied directory.

[webmasterworld.com...]

Seeing almost simultaneous, daily, hits by slupr and bonecho, for the same pages. I'm done with non-Inktomi Yahoo anything - Yahoo/Slurp/BonEcho/llf/ in ua's, and hostnames - for robots or web page access.

I haven't seen any real Slurp hits outside of these ranges. This about it for a Yahoo ip white list?

INKTOMI-BLK-3 66.196.64.0/18 (66.196.64.0 - 66.196.127.255)
INKTOMI-BLK-4 68.142.192.0/18 (68.142.192.0 - 68.142.255.255)
INKTOMI-BLK-5 72.30.0.0/16 (72.30.0.0 - 72.30.255.255)
INKTOMI-BLK-6 74.6.0.0/16 (74.6.0.0 - 74.6.255.255)

Phred

11:56 pm on July 26, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


I'm done with non-Inktomi Yahoo anything

phred,
Your certainly perceptive enough to determine what is benefical to your own websites.

However denying all of Yahoo is overkil, at least IMO.

You might try utilizing a conditional deny based on BOTH the IP's that you've provided and the UA that I specified here:

[webmasterworld.com...]

It hasn't effected my listings with Yahoo, nor should it yours.

Don

12:13 am on July 27, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Phred, most cases when you see Firefox UA being used by a SE it's making screen shots.

Do with that info as you will.

12:55 am on July 27, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


This about it for a Yahoo ip white list?

Dan keeps [iplists.com] this current

9:00 am on July 27, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


when you see Firefox UA being used by a SE it's making screen shots.

Bill, any idea what they use these screen shots for ?
11:47 pm on July 27, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Bill, any idea what they use these screen shots for ?

If in fact they are screenshots, I would assume they plan on adding them to the search results just like Ask and Snap did.

12:11 am on July 28, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


FYI, BonEcho has asked for over 7K pages in the last couple of days now.

That's way more data than they would need to check for a little cloaking.

3:13 am on Aug 1, 2008 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts: 3091
votes: 2


Got several hits from Yahoo crawler hlf*crawl on the 74.6.13.* range this evening with a wget UA, which is banned on my servers...

Wget/1.10.2 (Red Hat modified)

I've had this before and fished them out of the IP blacklist. Over 30 IPs blocked this time. I'm tempted to leave them blocked.

The crawler hit one file about a dozen times, 10 second gap. It then did the same again to the same file. Since the file is a contact form it's banned in robots.txt so another black mark. It then went on to hit another file several dozen times, again about 10 secs apart.