homepage Welcome to WebmasterWorld Guest from 54.198.142.255
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 50 message thread spans 2 pages: < < 50 ( 1 [2]     
Yahoo! Slurp
Two versions; one ignores robots.txt
Pfui




msg:4360954
 6:31 pm on Sep 10, 2011 (gmt 0)

For years -- YEARS -- I've denied Slurp all graphics in robots.txt and I just presumed it was heeding the restriction.

Wrong.

Depending on the Host and UA, the official Yahoo! Slurp apparently does whatever it wants to. Note the subtle differences in the subdomains and UAs...

This morning, the only Host to read/heed robots.txt was:

b3091154.crawl.yahoo.net [67.195.112.189]
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

These retrieved graphics by the pageful, over 60 total:

b5101137.yst.yahoo.net [98.137.72.218]
b5101139.yst.yahoo.net [98.137.72.228]
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

I can't say if this is new and/or MSN-related. I can say I'm irked.

 

Pfui




msg:4386743
 3:11 pm on Nov 14, 2011 (gmt 0)

The IPs hit with a bot-like UA including the word "crawler" on 183.79.63.0/24 - specifically between 183.79.63.77 - 183.79.63.110 but probably most of the /24. There were no proper rDNS entries for the bot IPs.

dstiles, I'm just seeing those today. A surprising amount of redundant activity all of a sudden, all with rDNS and using:

Y!J-BRW/1.0 crawler (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)

no068.trntl.kks.yahoo-net.jp [183.79.63.72]
01:55:53 /robots.txt
no043.trntl.kks.yahoo-net.jp [183.79.63.47]
01:55:53 /robots.txt
no050.trntl.kks.yahoo-net.jp [183.79.63.54]
01:55:58 /robots.txt
no071.trntl.kks.yahoo-net.jp [183.79.63.75]
05:52:23 /robots.txt
no094.trntl.kks.yahoo-net.jp [183.79.63.98]
05:52:23 /robots.txt

FWIW: 183.79.0.0/16 YAHOOJAPAN-BLK [robtex.com...]

dstiles




msg:4386920
 10:18 pm on Nov 14, 2011 (gmt 0)

I suspect the JP version of yahoo is doing its own thing now, what with Western yahoo using bing results. All a bit suspicious, though. :(

Also been seeing a (typical) SE kludge of yahoomobile coming in as a bot and as a proxy. I blocked first as one, now as the other. I do wish these companies would sort out their IP ranges properly. Still, a dual IP usage by yahoo isn't as bad as google's 6 or eight, I suppose. :(

Apart from the underline strange bot UA and the now-fixed 157 IP bot-range, MS seem to be quite reasonable and consistent on IPs. Which makes a nice change.

incrediBILL




msg:4393301
 6:53 am on Dec 2, 2011 (gmt 0)

I'm seeing yst.yahoo.net IPs download actual web pages, not just images, so blocking it may not be a good idea.

98.137.72.243 - - [01/Dec/2011:22:37:36 -0800] "GET /somepage.html HTTP/1.1" 200 2373 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)"

host 98.137.72.243
243.72.137.98.in-addr.arpa domain name pointer b5101141.yst.yahoo.net.

However, considering Bing is supplying results, does it really matter at the moment?

keyplyr




msg:4393303
 7:03 am on Dec 2, 2011 (gmt 0)

However, considering Bing is supplying results, does it really matter at the moment?

Absolutely. The way I see it, Yahoo may have outsourced web search to Bing at the moment, but obviously they have not stopped crawling sites and harvesting data, seemingly with various yet-to-be-determined objectives.

Yahoo continues as a world player and may just be an innovator in "the next big thing." I continue to treat Y! with respect just as I do Google and Bing.

incrediBILL




msg:4393323
 8:11 am on Dec 2, 2011 (gmt 0)

I continue to treat Y! with respect just as I do Google and Bing.


Personally, I find it hard to treat them with respect when they use SLURP as the user agent and don't use the proper full-trip DNS validation conventions, they're just making a mess out of agreed upon conventions, which I don't respect.

Likewise, having a hard time respecting Bing with all the junk coming from their crawler IPs.

Maybe I just expect too much from these so-called market leaders.

keyplyr




msg:4393330
 8:57 am on Dec 2, 2011 (gmt 0)


IMO Y! and MSN (et al) both screw around with rDNS validation. I get hit with MSN ranges all day long, none of which are tagged as crawl. I too expect too much I guess. After all it's not really a standard that is scrutinized by any governing agency, and I don't want it to be. The more regulation, the more the "haves" get what they want and the "have nots" loose IMO.

dstiles




msg:4393636
 11:31 pm on Dec 2, 2011 (gmt 0)

Bill - I wonder what the outcome will be for search bots if/when China buys up yahoo.

Bit I agree with keyplr, within reason: let slurp crawl as long as it behaves and returns reasonable DNS. We may be luck with the next version of the engine! :)

dstiles




msg:4394949
 11:25 pm on Dec 6, 2011 (gmt 0)

There is a report on the WebmasterWorld yahoo forum that slurp has not been seen for a few days on some sites. I don't have an entry for it for the past 36 hours or so, which is rare. Has slurp finally quit or is it in the garage for an overhaul? :)

Pfui




msg:4394975
 12:25 am on Dec 7, 2011 (gmt 0)

Just today:

llf531031.crawl.yahoo.net
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

12/06 1n:39:46 /robots.txt
12/06 1n:39:48 /

Speaking of... Looks like this might explain why Slurp/3.0 started fetching images -- completely ignoring robots.txt in the process. Emphasis mine:

Yahoo! lmage Search adds social sharing and continuous scrolling
Posted December 5th, 2011 at 11:13 am by Yahoo! Search

"We are introducing two new, important features today on Yahoo! Image search to help users get the best out of searching for photos online: social sharing and continuous scrolling. ..." [ysearchblog.com...]

dstiles




msg:4395359
 9:01 pm on Dec 7, 2011 (gmt 0)

Could be just image crawling, then? Could explain why I'm not getting hit at the moment - robots.txt blocks the image folders. Is it really ignoring robots.txt, though? If so my explanation fails.

Pfui




msg:4395426
 12:12 am on Dec 8, 2011 (gmt 0)

I started this thread because Yahoo was bot-running two versions of Slurp and one of them, Slurp/3.0, was reading and flat-out ignoring long-standing robots.txt Disallows for graphics, file types, directories, whathaveyous. Still is.

dstiles




msg:4395734
 8:46 pm on Dec 8, 2011 (gmt 0)

Strange. I've logged about 5 slurp hits to home pages in the past few days, although I cannot say about graphics without delving into individual site logs.

Pfui




msg:4395766
 9:56 pm on Dec 8, 2011 (gmt 0)

dstiles, you seem determined to dismiss and/or argue with reports of abuse by Slurp/3.0. Why so smitten?

dstiles




msg:4395774
 10:19 pm on Dec 8, 2011 (gmt 0)

Sorry if it looks that way, pfui, just reporting what I see here. :(

And as I noted: I do not check on image accesses, only pages. Life is too short to go through all the site logs...

...But I've just looked at one site over the past few days. Nothing since the 5th but on that day some images were taken that had been banned in robots.txt - but only one page worth, as though it were a preview or page thumbnail. The IP for this was registered to Inktomi - 74.6.13.95 - one of the yst.yahoo.net ones. I wonder if "inktomi" are crawling independently?

Pfui




msg:4398305
 5:32 pm on Dec 15, 2011 (gmt 0)

FWIW/Recap: All Slurps are Disallowed all graphics by filetype and directory via robots.txt and have been for years. The following is but a snapshot of 100+ hits to graphic files earlier today by --

b5101152.yst.yahoo.net
b5131387.yst.yahoo.net

-- using:

Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

07:40:35 /dir/example.gif
07:40:35 /dir/example.gif
07:40:36 /dir/example.gif
07:40:36 /dir/example.gif
07:40:36 /dir/example.gif
07:40:37 /dir/example.gif
07:40:37 /dir/example.gif
07:40:37 /dir/example.gif
07:40:38 /dir/example.gif
07:40:38 /dir/example.gif
07:40:39 /dir/example.gif
07:40:39 /dir/example.gif
07:40:40 /dir/example.gif
07:40:40 /dir/example.gif
07:40:41 /dir/example.gif
07:40:41 /dir/example.gif
07:40:42 /dir/example.gif
07:40:42 /dir/example.gif
07:40:43 /dir/example.gif
07:40:43 /dir/example.gif
07:40:44 /dir/example.gif
07:40:44 /dir/example.gif
07:40:45 /dir/example.gif
07:40:45 /dir/example.gif
07:40:46 /dir/example.gif
07:40:46 /dir/example.gif
07:40:47 /dir/example.gif
(partial listing)

robots.txt? NO (And not by any yahoo.net/.com Host/IP in the preceding 24 hours.)

Slurp/3.0 is also only allowed access to html files and robots.txt via .htaccess so all those hits (& scores more) were 403'd.

dstiles




msg:4407027
 5:15 pm on Jan 15, 2012 (gmt 0)

Not actually SLURP but a new (to me) Yahoo range:

124.108.64.0/18
Yahoo Asia - Internet Content Provider
Range based in HK but includes SG, AU and probably others.

Based on the description I'm assuming "content" to mean "servers" and blocked it.

wilderness




msg:4429440
 10:36 am on Mar 15, 2012 (gmt 0)

NO ID.
No images.
No robots.

209.191.87.214 - - [15/Mar/2012:02:08:51 +0000] "GET / HTTP/1.0" 301 234 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

lucy24




msg:4429702
 8:38 pm on Mar 15, 2012 (gmt 0)

Is 234 the size of your ancient-browser-handling page or do you have the world's smallest front page? :) (Can't count how often I've been bitten by the "no images or CSS, must be a robot" until I realize they were rewritten to a page that doesn't have extra files.)

209.191.87.214 ?! What are they doing in the middle of a bunch of apparent humans? I'd got 209.190.44.138 next door* flagged as "possibly some connection to google" but other than that, nothing unusual in the neighborhood.


* Somewhere I picked up 209.184-191 as a group, but can't find the reference now.

wilderness




msg:4429713
 9:11 pm on Mar 15, 2012 (gmt 0)

Is 234 the size of your ancient-browser-handling page or do you have the world's smallest front page?


lucy,
That's likely the canonical redirect, which it failed to follow.

my main page is 6,103, course that's without images.
403 is 539
404 is 197

dstiles




msg:4429736
 10:28 pm on Mar 15, 2012 (gmt 0)

I have the range 209.191.87.214 - 209.191.87.219 banned. There are (when I last checked DNS) a handful of actual bots in the 209.191 range but most are, as noted above, "human" or non-bot bots - if you see what I mean. :)

This 50 message thread spans 2 pages: < < 50 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved