Pfui

msg:4386743 | 3:11 pm on Nov 14, 2011 (gmt 0) |
| The IPs hit with a bot-like UA including the word "crawler" on 183.79.63.0/24 - specifically between 183.79.63.77 - 183.79.63.110 but probably most of the /24. There were no proper rDNS entries for the bot IPs. |
| dstiles, I'm just seeing those today. A surprising amount of redundant activity all of a sudden, all with rDNS and using: Y!J-BRW/1.0 crawler (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html) no068.trntl.kks.yahoo-net.jp [183.79.63.72] 01:55:53 /robots.txt no043.trntl.kks.yahoo-net.jp [183.79.63.47] 01:55:53 /robots.txt no050.trntl.kks.yahoo-net.jp [183.79.63.54] 01:55:58 /robots.txt no071.trntl.kks.yahoo-net.jp [183.79.63.75] 05:52:23 /robots.txt no094.trntl.kks.yahoo-net.jp [183.79.63.98] 05:52:23 /robots.txt FWIW: 183.79.0.0/16 YAHOOJAPAN-BLK [robtex.com...]
|
dstiles

msg:4386920 | 10:18 pm on Nov 14, 2011 (gmt 0) |
I suspect the JP version of yahoo is doing its own thing now, what with Western yahoo using bing results. All a bit suspicious, though. :( Also been seeing a (typical) SE kludge of yahoomobile coming in as a bot and as a proxy. I blocked first as one, now as the other. I do wish these companies would sort out their IP ranges properly. Still, a dual IP usage by yahoo isn't as bad as google's 6 or eight, I suppose. :( Apart from the underline strange bot UA and the now-fixed 157 IP bot-range, MS seem to be quite reasonable and consistent on IPs. Which makes a nice change.
|
incrediBILL

msg:4393301 | 6:53 am on Dec 2, 2011 (gmt 0) |
I'm seeing yst.yahoo.net IPs download actual web pages, not just images, so blocking it may not be a good idea. 98.137.72.243 - - [01/Dec/2011:22:37:36 -0800] "GET /somepage.html HTTP/1.1" 200 2373 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)" host 98.137.72.243 243.72.137.98.in-addr.arpa domain name pointer b5101141.yst.yahoo.net. However, considering Bing is supplying results, does it really matter at the moment?
|
keyplyr

msg:4393303 | 7:03 am on Dec 2, 2011 (gmt 0) |
| However, considering Bing is supplying results, does it really matter at the moment? |
| Absolutely. The way I see it, Yahoo may have outsourced web search to Bing at the moment, but obviously they have not stopped crawling sites and harvesting data, seemingly with various yet-to-be-determined objectives. Yahoo continues as a world player and may just be an innovator in "the next big thing." I continue to treat Y! with respect just as I do Google and Bing.
|
incrediBILL

msg:4393323 | 8:11 am on Dec 2, 2011 (gmt 0) |
| I continue to treat Y! with respect just as I do Google and Bing. |
| Personally, I find it hard to treat them with respect when they use SLURP as the user agent and don't use the proper full-trip DNS validation conventions, they're just making a mess out of agreed upon conventions, which I don't respect. Likewise, having a hard time respecting Bing with all the junk coming from their crawler IPs. Maybe I just expect too much from these so-called market leaders.
|
keyplyr

msg:4393330 | 8:57 am on Dec 2, 2011 (gmt 0) |
IMO Y! and MSN (et al) both screw around with rDNS validation. I get hit with MSN ranges all day long, none of which are tagged as crawl. I too expect too much I guess. After all it's not really a standard that is scrutinized by any governing agency, and I don't want it to be. The more regulation, the more the "haves" get what they want and the "have nots" loose IMO.
|
dstiles

msg:4393636 | 11:31 pm on Dec 2, 2011 (gmt 0) |
Bill - I wonder what the outcome will be for search bots if/when China buys up yahoo. Bit I agree with keyplr, within reason: let slurp crawl as long as it behaves and returns reasonable DNS. We may be luck with the next version of the engine! :)
|
dstiles

msg:4394949 | 11:25 pm on Dec 6, 2011 (gmt 0) |
There is a report on the WebmasterWorld yahoo forum that slurp has not been seen for a few days on some sites. I don't have an entry for it for the past 36 hours or so, which is rare. Has slurp finally quit or is it in the garage for an overhaul? :)
|
Pfui

msg:4394975 | 12:25 am on Dec 7, 2011 (gmt 0) |
Just today: llf531031.crawl.yahoo.net Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 12/06 1n:39:46 /robots.txt 12/06 1n:39:48 / Speaking of... Looks like this might explain why Slurp/3.0 started fetching images -- completely ignoring robots.txt in the process. Emphasis mine: Yahoo! lmage Search adds social sharing and continuous scrolling Posted December 5th, 2011 at 11:13 am by Yahoo! Search "We are introducing two new, important features today on Yahoo! Image search to help users get the best out of searching for photos online: social sharing and continuous scrolling. ..." [ysearchblog.com...]
|
dstiles

msg:4395359 | 9:01 pm on Dec 7, 2011 (gmt 0) |
Could be just image crawling, then? Could explain why I'm not getting hit at the moment - robots.txt blocks the image folders. Is it really ignoring robots.txt, though? If so my explanation fails.
|
Pfui

msg:4395426 | 12:12 am on Dec 8, 2011 (gmt 0) |
I started this thread because Yahoo was bot-running two versions of Slurp and one of them, Slurp/3.0, was reading and flat-out ignoring long-standing robots.txt Disallows for graphics, file types, directories, whathaveyous. Still is.
|
dstiles

msg:4395734 | 8:46 pm on Dec 8, 2011 (gmt 0) |
Strange. I've logged about 5 slurp hits to home pages in the past few days, although I cannot say about graphics without delving into individual site logs.
|
Pfui

msg:4395766 | 9:56 pm on Dec 8, 2011 (gmt 0) |
dstiles, you seem determined to dismiss and/or argue with reports of abuse by Slurp/3.0. Why so smitten?
|
dstiles

msg:4395774 | 10:19 pm on Dec 8, 2011 (gmt 0) |
Sorry if it looks that way, pfui, just reporting what I see here. :( And as I noted: I do not check on image accesses, only pages. Life is too short to go through all the site logs... ...But I've just looked at one site over the past few days. Nothing since the 5th but on that day some images were taken that had been banned in robots.txt - but only one page worth, as though it were a preview or page thumbnail. The IP for this was registered to Inktomi - 74.6.13.95 - one of the yst.yahoo.net ones. I wonder if "inktomi" are crawling independently?
|
Pfui

msg:4398305 | 5:32 pm on Dec 15, 2011 (gmt 0) |
FWIW/Recap: All Slurps are Disallowed all graphics by filetype and directory via robots.txt and have been for years. The following is but a snapshot of 100+ hits to graphic files earlier today by -- b5101152.yst.yahoo.net b5131387.yst.yahoo.net -- using: Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp) 07:40:35 /dir/example.gif 07:40:35 /dir/example.gif 07:40:36 /dir/example.gif 07:40:36 /dir/example.gif 07:40:36 /dir/example.gif 07:40:37 /dir/example.gif 07:40:37 /dir/example.gif 07:40:37 /dir/example.gif 07:40:38 /dir/example.gif 07:40:38 /dir/example.gif 07:40:39 /dir/example.gif 07:40:39 /dir/example.gif 07:40:40 /dir/example.gif 07:40:40 /dir/example.gif 07:40:41 /dir/example.gif 07:40:41 /dir/example.gif 07:40:42 /dir/example.gif 07:40:42 /dir/example.gif 07:40:43 /dir/example.gif 07:40:43 /dir/example.gif 07:40:44 /dir/example.gif 07:40:44 /dir/example.gif 07:40:45 /dir/example.gif 07:40:45 /dir/example.gif 07:40:46 /dir/example.gif 07:40:46 /dir/example.gif 07:40:47 /dir/example.gif (partial listing) robots.txt? NO (And not by any yahoo.net/.com Host/IP in the preceding 24 hours.) Slurp/3.0 is also only allowed access to html files and robots.txt via .htaccess so all those hits (& scores more) were 403'd.
|
dstiles

msg:4407027 | 5:15 pm on Jan 15, 2012 (gmt 0) |
Not actually SLURP but a new (to me) Yahoo range: 124.108.64.0/18 Yahoo Asia - Internet Content Provider Range based in HK but includes SG, AU and probably others. Based on the description I'm assuming "content" to mean "servers" and blocked it.
|
wilderness

msg:4429440 | 10:36 am on Mar 15, 2012 (gmt 0) |
NO ID. No images. No robots. 209.191.87.214 - - [15/Mar/2012:02:08:51 +0000] "GET / HTTP/1.0" 301 234 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
|
lucy24

msg:4429702 | 8:38 pm on Mar 15, 2012 (gmt 0) |
Is 234 the size of your ancient-browser-handling page or do you have the world's smallest front page? :) (Can't count how often I've been bitten by the "no images or CSS, must be a robot" until I realize they were rewritten to a page that doesn't have extra files.) 209.191.87.214 ?! What are they doing in the middle of a bunch of apparent humans? I'd got 209.190.44.138 next door* flagged as "possibly some connection to google" but other than that, nothing unusual in the neighborhood. * Somewhere I picked up 209.184-191 as a group, but can't find the reference now.
|
wilderness

msg:4429713 | 9:11 pm on Mar 15, 2012 (gmt 0) |
| Is 234 the size of your ancient-browser-handling page or do you have the world's smallest front page? |
| lucy, That's likely the canonical redirect, which it failed to follow. my main page is 6,103, course that's without images. 403 is 539 404 is 197
|
dstiles

msg:4429736 | 10:28 pm on Mar 15, 2012 (gmt 0) |
I have the range 209.191.87.214 - 209.191.87.219 banned. There are (when I last checked DNS) a handful of actual bots in the 209.191 range but most are, as noted above, "human" or non-bot bots - if you see what I mean. :)
|
| This 50 message thread spans 2 pages: < < 50 ( 1 [2] ) |
|
|