homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 50 message thread spans 2 pages: 50 ( [1] 2 > >     
Yahoo! Slurp
Two versions; one ignores robots.txt

 6:31 pm on Sep 10, 2011 (gmt 0)

For years -- YEARS -- I've denied Slurp all graphics in robots.txt and I just presumed it was heeding the restriction.


Depending on the Host and UA, the official Yahoo! Slurp apparently does whatever it wants to. Note the subtle differences in the subdomains and UAs...

This morning, the only Host to read/heed robots.txt was:

b3091154.crawl.yahoo.net []
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

These retrieved graphics by the pageful, over 60 total:

b5101137.yst.yahoo.net []
b5101139.yst.yahoo.net []
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

I can't say if this is new and/or MSN-related. I can say I'm irked.



 12:48 am on Sep 11, 2011 (gmt 0)

A few minutes ago, another .yst + /3.0 combo tried to slurp graphics:

b5131265.yst.yahoo.net []
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

robots.txt? NO

We shall see if 403'ing (gif|jpg|png) makes any difference...

ALL of the .yst + /3.0 hits have full-URL referrers. The 'original' Slurp never does.

Anyone else seeing this bad behavior?


 10:59 am on Sep 11, 2011 (gmt 0)

FWIW: .crawl.yahoo.net runs "Yahoo! Slurp" and "Yahoo! Slurp/3.0" simultaneously:

llf531068.crawl.yahoo.net []

[00:35:54] Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
[00:35:55] Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)


 12:27 pm on Sep 11, 2011 (gmt 0)

No, I haven't seen that behaviour you describe (as yet) in any of the sites under our control.

However, there was a time not long ago, when I couldn't understand why Yahoo were suddenly spidering the Images folder in one site, when they had been banned (via robots.txt) since day 1 (as they are in all our sites).

When I looked into it deeply, I found that the robots.txt in that one site had somehow become scrambled, such that all rules were on one continuous line, rendering it absolutely useless.

So that experience taught me to have a backup method for all sites, via an .htaccess located in the images folder itself:

RewriteCond %{HTTP_USER_AGENT} (bing|googlebot|msn|slurp|Yahoo) [NC]
RewriteRule .*\.(gif|jpg|jpeg|pdf|png|swf)$ - [F]

Add/subtract User-Agents according to your wishes.


 2:19 pm on Sep 11, 2011 (gmt 0)

I have Yahoo range - blocked.
Can't remember why exactly but it must have violated my robots.txt rules


 6:01 pm on Sep 11, 2011 (gmt 0)

I found that the robots.txt in that one site had somehow become scrambled, such that all rules were on one continuous line
Are you sure they were all on one line? Perhaps the file format was using Unix-style LF line endings and you were using a text editor that required DOS/Windows-style CR/LF line endings to view it?

 7:03 pm on Sep 11, 2011 (gmt 0)

1.) Mokita: Good thought about checking robots.txt. Mine's checks out A-OK. It's also CGI-generated so I know exactly what rules Yahoo/Slurp gets.

2.) Ironically, even "Yahoo! Slurp China" --


Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)

-- requests robots.txt, although it always ignores it. (So it gets 403'd for every file other than robots.txt.)

And then there's the Yahoo UA that never requests robots.txt, only favicon.co --


3.) Overall...

I get maybe 10 Yahoo-referred hits a month, and most are to two 'answers.yahoo.com' replies with links, not SERPs per se. Thus there's precious little benefit I can see in allowing anything other 'plain' "Yahoo! Slurp" from ".crawl.yahoo.com" access to anything other than .html files. YMMV


 9:50 pm on Sep 11, 2011 (gmt 0)

I have all of the YST IPs blocked. I only allow crawls from crawl IPs.

98.137.72/24 is pretty much a slurp range but without the crawl rDNS. I wonder why they do this - perhaps it's an excuse to get images?


 10:14 pm on Sep 11, 2011 (gmt 0)

Or they're developing the equivalent of a snapshot/Google Web Preview thing? Or they could care less and less about robots.txt? ;)

Note that "Yahoo! Slurp/3.0" was a sure-fire culprit at my end and it crawled from .yst and .crawl servers.


 8:12 pm on Sep 12, 2011 (gmt 0)

Interesting about the /3.0. I enabled that one recently on the crawl IPs after finding it blocked on UA. I assumed it was simply an upgrade on the bot. I'll keep an eye on it. Thanks.


 7:39 pm on Sep 16, 2011 (gmt 0) Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]

Just scraped an entire 148 page site including html, js, css and image files.


 8:48 pm on Sep 16, 2011 (gmt 0)

Haven't seen anything like this. I have about 300 hits by Slurp/3.0 out of 62,000-ish so far this month (checked only for this month!).

I did get 13 hits from the range 98.137.72/24 this month from Slurp/3.0 but those are killed as being non-crawl IPs.

On the other hand, I did get about 50 hits from googlebot within a couple of minutes today, all on the same (permitted) page. That was very odd.


 12:24 am on Sep 17, 2011 (gmt 0)

keyplyr, did that Slurp/3.0 read robots.txt? Did it heed it? (I don't know if you limit bots to certain filetypes or not.)

Also, I just plugged into robtex to see if it was .yst.yahoo.net or .crawl.yahoo.net and -- and -- it's BOTH.

"b5131249.yst.yahoo.net and llf520009.crawl.yahoo.net point to"

So much for my 'trusting' .crawl.yahoo.net over .yst.yahoo.net.

So it looks like right now, if anyone doesn't want "Slurp/3.0" to scrape, you'll need to curb its conduct via some form of access control, e.g.:

RewriteCond %{HTTP_USER_AGENT} Slurp [NC]
RewriteCond %{REQUEST_URI} !(botbait|robots|html)
RewriteRule .* - [F]

(YMMV. I use that in conjunction with a separate rule limiting Slurp UAs to .yahoo domains. Fake Slurps get 403'd, ditto the longstanding robots.txt-abusing Yahoo! Slurp China.)


 9:02 am on Sep 17, 2011 (gmt 0)

keyplyr, did that Slurp/3.0 read robots.txt?

No request for robots.txt, just hit a popular page and took all the files, then another page etc... like a browser would.

With further examination it did *not* take the entire site, just looked that way at first glance :)


 5:22 pm on Sep 17, 2011 (gmt 0)

I have one entry from yesterday for a blocked template image by Yahoo slurp... irritating.

Ignoring robots.txt is becoming the rage among search engines, Google has also said they will ignore it if you put the +1 button on a page.


 3:37 am on Sep 18, 2011 (gmt 0)

Pardon my ignorance, but how does a search engine profit by ignoring the robots.txt? Do they 'win' something?


 5:21 am on Sep 18, 2011 (gmt 0)

They win data. And the company with the most data -- to harvest, to manipulate, to regurgitate, to advertise, to monetize -- wins.


 11:27 am on Sep 18, 2011 (gmt 0)

If Yahooo SERP shows Bing Crawler results, why allow Slurp to crawl anything?

Just asking.


 4:59 pm on Sep 18, 2011 (gmt 0)

Good question. My answer? Prudence.

Since at least 2006-2007 when MSN started simultaneously running lots and lots and LOTS of bots -- bingbot, msnbot, msnbot/2.0b, msnbot-media, livebot-searchsense, MSNPTC, msrbot, msnbot-Products, msnbot-NewsBlogs, MSNBOT_Mobile, MS Search 4.0 Robot, yadda-yadda -- it's been tough determining which bots data-share with each other, or which blocked bots might impact SERPs.

And MSN runs 'unofficial' bots, too: MSN's many cloaked bots. Again. [webmasterworld.com...]

So now, while Bing and Yahoo hammer out integration/assimiliation and which bots may data-share with each other, I'm reluctant to deny any of their bots whole-hog. That's why I limit based on combinations of IP/Host, filetype, and UA, just as I've been doing with Yahoo, MSN, and Google for years.

Speaking of UA-specific access control...

"Yahoo! Slurp/3.0" ignores robots.txt (ditto "Yahoo! Slurp China"). 'Plain' "Yahoo! Slurp" -- no version number -- is complying. At this time...


 3:50 am on Sep 20, 2011 (gmt 0)

Apropos of I dunno what, 'plain' Slurp just cruised through and logged six same-dir index page referrers (out of 20 files hit throughout the day).

I don't recall any Yahoo bot EVER using (or appearing to use) referrers.

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)



 4:45 am on Sep 20, 2011 (gmt 0)

(ditto "Yahoo! Slurp China")

I must be living a charmed life since the quoted above has routinely obeyed my robots.txt. (Scratching head)


 3:03 pm on Sep 28, 2011 (gmt 0)

Unfortunately, Slurp/3.0 apparently retrieved countless pages before I realized it was scraping graphics.

Thus now, every single time it re-visits, it generates 403s galore for all the files it's now denied -- files it was trusted not to 'take' in the first place.

For example, this visit netted 1 html -- and racked up 39 errors (in a curiously slow 16 seconds):

b5131219.yst.yahoo.net []
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

robots.txt? NO


 4:12 pm on Sep 28, 2011 (gmt 0)

Not slurp but closely allied, I think.

Yesterday/today I had a number of hits from a (new to me) IP range allocated to Yahoo Japan -

The IPs hit with a bot-like UA including the word "crawler" on - specifically between - but probably most of the /24. There were no proper rDNS entries for the bot IPs.

I have found little information about the bot outside of a few logs reported in SERPS and the URL in the UA is in Japanese - if anyone can translate it I'd be interested.

UA: Y!J-BRW/1.0 crawler (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)


 10:15 pm on Sep 28, 2011 (gmt 0)

- Did it read, and heed, robots.txt?

- Any rDNS data for any single IP?

- I don't know if that UA is new, or old, or a new name for a new, or old, hybrid, or what. Just that its name recalls a number of UAs Yahoo spawned in recent years, dating back to when Y was very, very picky about UAs being specifically named in robots.txt so we included them all...

User-agent: YahooMobile
User-agent: YahooCacheSystem
User-agent: Yahoo! Slurp/Site Explorer
User-agent: Mozilla/4.05 [en]
User-agent: LTI/LemurProject
User-agent: Yahoo-Blogs
User-agent: Yahoo-Blogs/v3.9
User-agent: Yahoo-MMCrawler
User-agent: Yahoo-MMCrawler/3.x
User-agent: YahooYSMcm
User-agent: YahooYSMcm/2.0.0
User-agent: Yahoo-Test
User-agent: Yahoo! Mindset
User-agent: Y!J-BSC
User-agent: Y!J-BSC/1.0
User-agent: Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: y!j-bsc
User-agent: y!j-bsc/1.0
User-agent: y!j-bsc/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: Y!J
User-agent: Y!J/1.0
User-agent: Y!J/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: y!j
User-agent: y!j/1.0
User-agent: y!j/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: Mozilla/4.0 (compatible; Y!J; for robot study; keyoshid)
User-agent: Mozilla/4.0 (compatible; y!j; for robot study; keyoshid)
User-agent: Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)
User-agent: Mozilla/5.0 (compatible; Yahoo! DE Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
User-agent: Mozilla/5.0 (Yahoo-Test/4.0 mailto:vertical-crawl-support@yahoo-inc.com)
Disallow: /

(Hmm. I could probably just cut all those lines because now I only allow User-agent: Slurp (w/ specific rules) and mod_rewrite/whitelist (or blacklist) everything else from Y. It's --

User-agent: *
Disallow: /

-- or bust:)


 10:31 pm on Sep 28, 2011 (gmt 0)

I know you don't need them, but (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html) lists a whole swag of UAs, many not in your list:

# Y!J-VSC/ViSe
# Y!J SearchMonkey
# YahooSeeker


 1:13 pm on Sep 29, 2011 (gmt 0)

Speaking of Yahoo-specific UAs, this circa 2010 app's run by everyone other than Yahoo:

Yahoo! (iPhone Inquisitor; 1.0)

Thus far I've only seen it go for favicons, akin to its kin:


(Search for that UA on this site for info and threads.)


 6:53 pm on Sep 29, 2011 (gmt 0)

Pfui - my report is from my security logs, generated by the ASP page itself, not from the full site logs, so without more investigation than I have time for at present I do not know if robots.txt was accessed.

Mokita - without being able to read Japanese there is no way of knowing what those UAs are - and they are not, judging from my example, complete.


 10:54 pm on Sep 29, 2011 (gmt 0)

dstiles - I don't read Japanese either, but Yahoo's Babelfish does a reasonable job of translating. Albeit somewhat tortured, it is understandable.

Prefacing the majority of user agents is:

* About the crawler
It goes around the web page, the system which you collect & accumulate the contents it calls “the crawler”. Yahoo! JAPAN, the crawler which had the following kind of user agent, with purpose such as utilization and research and development with search service, has done the collection and accumulation of the web page.

Prefacing the last five is this para:

* About the web page verification system
Yahoo!JAPAN The system which had the following kind of user agent for verifying accesses the web page in the web page which is linked from each page of Yahoo!JAPAN. These systems do not do the collection accumulation of the web page.

Is it a complete list? I have no idea. I was only quoting what is on that page in the hope that it is of use to someone.


 8:00 pm on Sep 30, 2011 (gmt 0)

Thanks, Mokita. Not really helpful as it goes, is it? :)

I'm in two minds as to blocking it but at present I'm leaving it: the site the bot hit has business from Japan.


 5:14 am on Oct 13, 2011 (gmt 0)

FYI re:

Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

AKA: llf520099.crawl.yahoo.net

robots.txt? NO

This file was allowed in robots.txt:

20:48:13 /dir/filename.html

The graphics on that page were not (& were 403'd via .htaccess):

20:48:16 /dir/filename.gif
20:48:16 /dir/filename.jpg
20:48:17 /dir/filename.jpg
20:48:17 /dir/filename.jpg
20:48:17 /dir/filename.jpg
20:48:17 /dir/filename.jpg
20:48:18 /dir/filename.jpg
20:48:18 /dir/filename.jpg
20:48:19 /dir/filename.jpg
20:48:19 /dir/filename.jpg
20:48:20 /dir/filename.jpg
20:48:20 /dir/filename.jpg
20:48:21 /dir/filename.jpg
20:48:21 /dir/filename.jpg
20:48:22 /dir/filename.jpg
20:48:22 /dir/filename.jpg
20:48:23 /dir/filename.jpg
20:48:23 /dir/filename.jpg
20:48:23 /dir/filename.jpg
20:48:23 /dir/filename.jpg
20:48:24 /dir/filename.jpg
20:48:24 /dir/filename.jpg
20:48:25 /dir/filename.jpg
20:48:25 /dir/filename.jpg
20:48:26 /dir/filename.gif
20:48:26 /dir/filename.gif
20:48:27 /dir/filename.gif
20:48:27 /dir/filename.gif
20:48:28 /dir/filename.gif

(Also disallowed, and unexecuted, was an "#include virtual" .js file.)

This 50 message thread spans 2 pages: 50 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved