Welcome to WebmasterWorld Guest from 54.144.246.252

Forum Moderators: Ocean10000 & incrediBILL

Yahoo! Slurp

Two versions; one ignores robots.txt

   
6:31 pm on Sep 10, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



For years -- YEARS -- I've denied Slurp all graphics in robots.txt and I just presumed it was heeding the restriction.

Wrong.

Depending on the Host and UA, the official Yahoo! Slurp apparently does whatever it wants to. Note the subtle differences in the subdomains and UAs...

This morning, the only Host to read/heed robots.txt was:

b3091154.crawl.yahoo.net [67.195.112.189]
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

These retrieved graphics by the pageful, over 60 total:

b5101137.yst.yahoo.net [98.137.72.218]
b5101139.yst.yahoo.net [98.137.72.228]
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

I can't say if this is new and/or MSN-related. I can say I'm irked.
3:11 pm on Nov 14, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The IPs hit with a bot-like UA including the word "crawler" on 183.79.63.0/24 - specifically between 183.79.63.77 - 183.79.63.110 but probably most of the /24. There were no proper rDNS entries for the bot IPs.

dstiles, I'm just seeing those today. A surprising amount of redundant activity all of a sudden, all with rDNS and using:

Y!J-BRW/1.0 crawler (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)

no068.trntl.kks.yahoo-net.jp [183.79.63.72]
01:55:53 /robots.txt
no043.trntl.kks.yahoo-net.jp [183.79.63.47]
01:55:53 /robots.txt
no050.trntl.kks.yahoo-net.jp [183.79.63.54]
01:55:58 /robots.txt
no071.trntl.kks.yahoo-net.jp [183.79.63.75]
05:52:23 /robots.txt
no094.trntl.kks.yahoo-net.jp [183.79.63.98]
05:52:23 /robots.txt

FWIW: 183.79.0.0/16 YAHOOJAPAN-BLK [robtex.com...]
10:18 pm on Nov 14, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I suspect the JP version of yahoo is doing its own thing now, what with Western yahoo using bing results. All a bit suspicious, though. :(

Also been seeing a (typical) SE kludge of yahoomobile coming in as a bot and as a proxy. I blocked first as one, now as the other. I do wish these companies would sort out their IP ranges properly. Still, a dual IP usage by yahoo isn't as bad as google's 6 or eight, I suppose. :(

Apart from the underline strange bot UA and the now-fixed 157 IP bot-range, MS seem to be quite reasonable and consistent on IPs. Which makes a nice change.
6:53 am on Dec 2, 2011 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I'm seeing yst.yahoo.net IPs download actual web pages, not just images, so blocking it may not be a good idea.

98.137.72.243 - - [01/Dec/2011:22:37:36 -0800] "GET /somepage.html HTTP/1.1" 200 2373 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)"

host 98.137.72.243
243.72.137.98.in-addr.arpa domain name pointer b5101141.yst.yahoo.net.

However, considering Bing is supplying results, does it really matter at the moment?
7:03 am on Dec 2, 2011 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



However, considering Bing is supplying results, does it really matter at the moment?

Absolutely. The way I see it, Yahoo may have outsourced web search to Bing at the moment, but obviously they have not stopped crawling sites and harvesting data, seemingly with various yet-to-be-determined objectives.

Yahoo continues as a world player and may just be an innovator in "the next big thing." I continue to treat Y! with respect just as I do Google and Bing.
8:11 am on Dec 2, 2011 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I continue to treat Y! with respect just as I do Google and Bing.


Personally, I find it hard to treat them with respect when they use SLURP as the user agent and don't use the proper full-trip DNS validation conventions, they're just making a mess out of agreed upon conventions, which I don't respect.

Likewise, having a hard time respecting Bing with all the junk coming from their crawler IPs.

Maybe I just expect too much from these so-called market leaders.
8:57 am on Dec 2, 2011 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




IMO Y! and MSN (et al) both screw around with rDNS validation. I get hit with MSN ranges all day long, none of which are tagged as crawl. I too expect too much I guess. After all it's not really a standard that is scrutinized by any governing agency, and I don't want it to be. The more regulation, the more the "haves" get what they want and the "have nots" loose IMO.
11:31 pm on Dec 2, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Bill - I wonder what the outcome will be for search bots if/when China buys up yahoo.

Bit I agree with keyplr, within reason: let slurp crawl as long as it behaves and returns reasonable DNS. We may be luck with the next version of the engine! :)
11:25 pm on Dec 6, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



There is a report on the WebmasterWorld yahoo forum that slurp has not been seen for a few days on some sites. I don't have an entry for it for the past 36 hours or so, which is rare. Has slurp finally quit or is it in the garage for an overhaul? :)
12:25 am on Dec 7, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Just today:

llf531031.crawl.yahoo.net
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

12/06 1n:39:46 /robots.txt
12/06 1n:39:48 /

Speaking of... Looks like this might explain why Slurp/3.0 started fetching images -- completely ignoring robots.txt in the process. Emphasis mine:

Yahoo! lmage Search adds social sharing and continuous scrolling
Posted December 5th, 2011 at 11:13 am by Yahoo! Search

"We are introducing two new, important features today on Yahoo! Image search to help users get the best out of searching for photos online: social sharing and continuous scrolling. ..." [ysearchblog.com...]
9:01 pm on Dec 7, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Could be just image crawling, then? Could explain why I'm not getting hit at the moment - robots.txt blocks the image folders. Is it really ignoring robots.txt, though? If so my explanation fails.
12:12 am on Dec 8, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



I started this thread because Yahoo was bot-running two versions of Slurp and one of them, Slurp/3.0, was reading and flat-out ignoring long-standing robots.txt Disallows for graphics, file types, directories, whathaveyous. Still is.
8:46 pm on Dec 8, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Strange. I've logged about 5 slurp hits to home pages in the past few days, although I cannot say about graphics without delving into individual site logs.
9:56 pm on Dec 8, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



dstiles, you seem determined to dismiss and/or argue with reports of abuse by Slurp/3.0. Why so smitten?
10:19 pm on Dec 8, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Sorry if it looks that way, pfui, just reporting what I see here. :(

And as I noted: I do not check on image accesses, only pages. Life is too short to go through all the site logs...

...But I've just looked at one site over the past few days. Nothing since the 5th but on that day some images were taken that had been banned in robots.txt - but only one page worth, as though it were a preview or page thumbnail. The IP for this was registered to Inktomi - 74.6.13.95 - one of the yst.yahoo.net ones. I wonder if "inktomi" are crawling independently?
5:32 pm on Dec 15, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



FWIW/Recap: All Slurps are Disallowed all graphics by filetype and directory via robots.txt and have been for years. The following is but a snapshot of 100+ hits to graphic files earlier today by --

b5101152.yst.yahoo.net
b5131387.yst.yahoo.net

-- using:

Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

07:40:35 /dir/example.gif
07:40:35 /dir/example.gif
07:40:36 /dir/example.gif
07:40:36 /dir/example.gif
07:40:36 /dir/example.gif
07:40:37 /dir/example.gif
07:40:37 /dir/example.gif
07:40:37 /dir/example.gif
07:40:38 /dir/example.gif
07:40:38 /dir/example.gif
07:40:39 /dir/example.gif
07:40:39 /dir/example.gif
07:40:40 /dir/example.gif
07:40:40 /dir/example.gif
07:40:41 /dir/example.gif
07:40:41 /dir/example.gif
07:40:42 /dir/example.gif
07:40:42 /dir/example.gif
07:40:43 /dir/example.gif
07:40:43 /dir/example.gif
07:40:44 /dir/example.gif
07:40:44 /dir/example.gif
07:40:45 /dir/example.gif
07:40:45 /dir/example.gif
07:40:46 /dir/example.gif
07:40:46 /dir/example.gif
07:40:47 /dir/example.gif
(partial listing)

robots.txt? NO (And not by any yahoo.net/.com Host/IP in the preceding 24 hours.)

Slurp/3.0 is also only allowed access to html files and robots.txt via .htaccess so all those hits (& scores more) were 403'd.
5:15 pm on Jan 15, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Not actually SLURP but a new (to me) Yahoo range:

124.108.64.0/18
Yahoo Asia - Internet Content Provider
Range based in HK but includes SG, AU and probably others.

Based on the description I'm assuming "content" to mean "servers" and blocked it.
10:36 am on Mar 15, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



NO ID.
No images.
No robots.

209.191.87.214 - - [15/Mar/2012:02:08:51 +0000] "GET / HTTP/1.0" 301 234 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
8:38 pm on Mar 15, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Is 234 the size of your ancient-browser-handling page or do you have the world's smallest front page? :) (Can't count how often I've been bitten by the "no images or CSS, must be a robot" until I realize they were rewritten to a page that doesn't have extra files.)

209.191.87.214 ?! What are they doing in the middle of a bunch of apparent humans? I'd got 209.190.44.138 next door* flagged as "possibly some connection to google" but other than that, nothing unusual in the neighborhood.


* Somewhere I picked up 209.184-191 as a group, but can't find the reference now.
9:11 pm on Mar 15, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Is 234 the size of your ancient-browser-handling page or do you have the world's smallest front page?


lucy,
That's likely the canonical redirect, which it failed to follow.

my main page is 6,103, course that's without images.
403 is 539
404 is 197
10:28 pm on Mar 15, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I have the range 209.191.87.214 - 209.191.87.219 banned. There are (when I last checked DNS) a handful of actual bots in the 209.191 range but most are, as noted above, "human" or non-bot bots - if you see what I mean. :)
This 50 message thread spans 2 pages: 50
 

Featured Threads

Hot Threads This Week

Hot Threads This Month