homepage Welcome to WebmasterWorld Guest from 54.205.168.88
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Yahoo! Crawlers - A response from Yahoo! Search
Response from Yahoo!
Yahoo_Mike




msg:3006511
 2:47 am on Jul 14, 2006 (gmt 0)

(continuing from here [webmasterworld.com])

Thanks Gary for your support and communication. We appreciate the feedback and input from you and the community on this thread. We are working with the various groups within Yahoo! to make sure they are more clear about their user agents and compliant with robots.txt. We are also looking at our Slurp crawl to see if we can make some improvements to reduce webmaster load without affecting coverage and freshness. For Yahoo! Slurp crawl issues, please write to us at [help.yahoo.com...] to report any problems.

The mp3spider operated by Yahoo! China is deploying an update to follow 'mp3spider' rules in /robots.txt.

The Yahoo! China Slurp was indeed following 'Slurp' user-agent rules in preference to 'Slurp China' rules. The Yahoo! China team have corrected that, so China Slurp will now observe its own specific rules instead of Slurp rules.

The Yahoo! Mindset agent reads pages for an 'Intent-driven Search' beta at [mindset.research.yahoo.com...] The Mindset agent visits pages already included in Yahoo! search result listings and does not do any extraction crawling nor content refreshes. In any case, the Mindset team has taken their robot out of service until it is corrected to observe /robots.txt exclusions.

Visits from clients at proxyn.search.dcn.yahoo.net or proxyn.search.acd.yahoo.net are not crawler activity but proxies for browser page views using a 'translate this page' link from Yahoo! search results. Babelfish provides the automated translation. The Babelfish team will be modifying their proxy headers to more accurately reflect the page access.

The 'Mozilla/4.0' UA from Overture Services (like 66.228.173.150 as described in [webmasterworld.com...] ) is an agent for editorial checking from Yahoo! Search Marketing (Overture). This editorial agent only reads URLs submitted by advertisers for sponsored search listing by Yahoo! Search Marketing (YSM); it does no extraction and is not a public content crawler. Information about the editorial agent is in the terms of service documents for advertisers listing with Yahoo! Search Marketing. jdMorgan's post on this subject in [webmasterworld.com...] is excellent. This YSM agent was updated the week of June 5 to rate limit its activity better after an advertiser does a batch submit.

Note also that we've retired the ystfeedback@yahoo.com and webmasterworldfeedback@yahoo.com email addresses. We've created simple forms via which you can provide us feedback.
We announced this on the Yahoo! Search blog a while back. More details are at [ysearchblog.com...]

Webmaster resources are available at the following URL:
[help.yahoo.com...]

I hope this information helps. Thanks.

[edited by: volatilegx at 3:56 am (utc) on July 14, 2006]

 

jdMorgan




msg:3006687
 5:35 am on Jul 14, 2006 (gmt 0)

Yahoo_Mike,

Thanks for the feedback, and for the actions your teams have taken.

The Yahoo! China Slurp was indeed following 'Slurp' user-agent rules in preference to 'Slurp China' rules. The Yahoo! China team have corrected that, so China Slurp will now observe its own specific rules instead of Slurp rules.

I hope that statement was an intentional simplification, as the correct behaviour would be to accept robots.txt records with User-agent tokens in the following priority:

  • "Yahoo! Slurp China"
  • "Yahoo Slurp China"
  • "Slurp China"
  • "Yahoo! China" or "Yahoo China"
  • "Yahoo!", "Yahoo", or "Slurp"
  • "*"

The Yahoo! Mindset agent reads pages for an 'Intent-driven Search' beta at [mindset.research.yahoo.com...] The Mindset agent visits pages already included in Yahoo! search result listings and does not do any extraction crawling nor content refreshes. In any case, the Mindset team has taken their robot out of service until it is corrected to observe /robots.txt exclusions.

Is there a reason it cannot use the previously-collected Slurp dataset? Or the same crawler? Yahoo! is a big company, but may I humbly and earnestly suggest that you guard against the proliferation of too many agents and too many user-agent strings? I would generally object on principle to the additional "Slurp China" user-agent string, except for the 'special circumstances' that China requires.

Same thing for "Slurp DE" -- can it share the dataset? This one at least needs a more-specific name: My original thought was that is was from Germany (country-code "DE") -- perhaps also in response to that country's special requirements concerning certain search results. "DE" gave me no clue that it was related to the Yahoo! directory.

Visits from clients at proxyn.search.dcn.yahoo.net or proxyn.search.acd.yahoo.net are not crawler activity but proxies for browser page views using a 'translate this page' link from Yahoo! search results. Babelfish provides the automated translation. The Babelfish team will be modifying their proxy headers to more accurately reflect the page access.

This kind of agent (language and markup tranlator) is the most problematic. Because most webmasters only have access to the HTTP headers available in NCSA extended/combined log format (Referer and User-agent), I'd argue in favor of 'injecting' an additional parameter into the translation proxy user's User-agent string, or modifying the Referer header. This because most webmasters won't even be aware of the proxy-related headers like "HTTP_VIA", "HTTP_X_FORWARDED_FOR", etc. -- or have the knowledge or capability to test them.

We had a discussion here years ago with a spider author who was upset about being massively banned, because he thought he'd done his due-diligence by providing identifying and contact info in the "HTTP_FROM" header. But the fact is that almost no Webmasters can see that header in their logs; It requires a custom log format, and that level of configuration is simply not available on common name-based virtual hosting accounts.

So anyway, I'll toss that UA-string-injection idea out for your team to consider: Find the last substring in the user's UA string bounded by ";" and ")", insert "; Yahoo! Babelfish/1.0" ahead of the ")" and you're good to go. (Of course, that assumes that the user's UA string is valid to begin with.) I'm suggesting this only if it doesn't make more sense to modify the referer, and only for 'translatable objects', not for images, stylesheets, and client-side scripts, which I presume just pass through the translation proxy unmodified.

The 'Mozilla/4.0' UA from Overture Services (like 66.228.173.150 as described in [webmasterworld.com...] ) is an agent for editorial checking from Yahoo! Search Marketing (Overture). This editorial agent only reads URLs submitted by advertisers for sponsored search listing by Yahoo! Search Marketing (YSM); it does no extraction and is not a public content crawler. Information about the editorial agent is in the terms of service documents for advertisers listing with Yahoo! Search Marketing. [...] This YSM agent was updated the week of June 5 to rate limit its activity better after an advertiser does a batch submit.

I'm not familiar with the 'environment' in which this Overture User-agent operates -- To be blunt -- whether you can 'trust' the target sites. But for the majority of activity, I really come down on the side of those who say that all automated agents should identify themselves, even if just with a "(compatible; Yahoo! Sponsored Search)" or "(compatible; Overture)" in there. The plain "Mozilla/4.0" User-agent is commonly used for site scraping and other exploits, and is persona non-grata on many sites. Yahoo can use it on my sites, but only because the blocking rule contains IP and remote host exclusions to let Yahoo! in. So, I'd say identify yourself, spot-check for underhanded UA-based cloaking (if necessary) using a valid browser UA, and drop any subscriber that breaks your TOS. Rule of law and all that...

I hope these comments are useful. Thanks again to you and your teams for listening to feedback and taking action on our concerns.

Jim

incrediBILL




msg:3007996
 6:48 pm on Jul 14, 2006 (gmt 0)

Is there a reason it cannot use the previously-collected Slurp dataset? Or the same crawler?

I'm 100% behind this sentiment as there are ZERO reasons that we have to subject ourselves to more than one Yahoo! crawler. They should all share a common data set internally and not continue to burn up our bandwidth over and over and over.

If that common data is too old for the task that wants it, fine, queue it up for a refresh crawl, but only ONE refresh crawl, not FIVE.

Just my $0.02 worth.

incrediBILL




msg:3008348
 10:49 pm on Jul 14, 2006 (gmt 0)

Mike,

BTW, you didn't explain this:

207.126.225.132 [feed18.shop.corp.yahoo.com.] requested 1 pages as "libwww-perl/5.69"

Mokita




msg:3008468
 11:39 pm on Jul 14, 2006 (gmt 0)

looking at our Slurp crawl to see if we can make some improvements to reduce webmaster load without affecting coverage and freshness

Thank you! That would be a wonderful relief. One of our sites has 110 pages available for indexing, but only 18 pages are likely to change - the rest are static information pages or product pages which don't normally change once they've been posted. Crawling those once per month would be quite sufficient for freshness.

In the first 14 days of this month the Yahoo! Slurp bot has requested 737 pages and robots.txt 449 times - roughly 32 times per day. This is really taking freshness to the extreme! Compare Slurp figures with Googlebot, which in the same period has requested 172 pages and robots.txt 16 times - slightly more than once per day. I am perfectly happy with the freshness of our results in Google.

Does Slurp take any notice of the "revisit-after" metatag?
<meta name="revisit-after" content="30 days">

If it does, I'd be more than happy to use it.

bobothecat




msg:3009883
 9:57 pm on Jul 15, 2006 (gmt 0)

What about this one:

66.228.165.143 - - [15/Jul/2006:15:54:12 -0600] "GET / HTTP/1.1" 301 344 "-" "Mozilla/4.5 [en] (Win98; I)"

I have nothing to do with Overture, or YSM ( or any other PPC for that matter), yet the crawls continue.

Pfui




msg:3010781
 9:38 pm on Jul 16, 2006 (gmt 0)

1.) I'm pleased and appreciative that Yahoo_Mike took the time to respond to a number of concerns raised in the original post:

Naughty Yahoo User Agents
[webmasterworld.com...]

Addressed in his response are:

Yahoo! China
Yahoo! Mindset
proxyn.search.dcn.yahoo.net
proxyn.search.acd.yahoo.net
Mozilla/4.0 (Overture).

.
2.) Alas, still unaddressed are the majority of entries I listed in what is now message "#:400187" including the following (with still more new ones, below; plus cut-to-the-chase stats in #6):

>>
-----
dp131.data.yahoo.com
Mozilla/4.0

rlx-1-2-1.labs.corp.yahoo.com
Mozilla/4.0

-----
r17.mk.cnb.yahoo.com
m23.mk.cnb.yahoo.com
(multi)
Gaisbot/3.0+(robot05[@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)
Gaisbot/3.0+(robot06@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)

-----
urlc1.mail.mud.yahoo.com
urlc2.mail.mud.yahoo.com
urlc3.mail.mud.yahoo.com
urlc4.mail.mud.yahoo.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

-----
ts2.test.mail.mud.yahoo.com
(68.142.203.133)
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

-----
203.141.52.37
203.141.52.39
203.141.52.44
(multi)
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/blog-search/)

203.141.52.47
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)

ont211014008240.yahoo.co.jp
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)

-----
mmcrm4070.search.mud.yahoo.com
Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)

-----
proxy1.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1)
proxy1.search.dcn.yahoo.net
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1

proxy2.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie)
proxy2.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

proxy3.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie)
proxy3.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

-----
msfp01.search.mud.yahoo.com
(side-scroll edited)
Nokia6682/2.0 (3.01.1) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 configuration/CLDC-1.1 UP.Link/6.3.0.0.0
(compatible; Windows CE; Blazer/4.0; PalmSource; MOT-V300; SEC-SGHE315;
YahooSeeker/MA-R2D2;mobile-search-customer-care AT yahoo-inc dot com)

mmcrm4070.search.mud.yahoo.com
Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)

opnprc1.search.mud.yahoo.com
Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; [help.yahoo.com...] )

-----
oc4.my.dcn.yahoo.com
YahooFeedSeeker/1.0 (compatible; Mozilla 4.0; MSIE 5.5; [publisher.yahoo.com...]

-----
All referers beginning: "http://rds.yahoo.com/"
<<

.
3.) And here's another Yahoo Host/UA that never asks for robots.txt during its weekly visit:

morgue2.corp.yahoo.com
Mozilla/4.05 [en]

And what's this new referer? "http://yq.search.yahoo.com/"

.
4.) And then there are these new -- licensees? Fakes? And again, no robots.txt by either:

www.io.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
07/06 09:36:11 /

phad.cc.umanitoba.ca
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
07/15 12:19:20 /

.
5.) I try to cooperate with Yahoo but I'm repeatedly abused by them, by their scores of obvious and covert bots and UAs and IPs, by their retrieving info and licensing it -- e.g., thumbnails to Viewpoint -- info that they're not supposed to retrieve in the first place.

Where's the Help page with the code to prevent a thumbnail grab? Because apparently this list missed the little sucker:

RewriteCond %{REMOTE_HOST} \.inktomi\.com$ [NC,OR]
RewriteCond %{REMOTE_HOST} \.inktomisearch\.com$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Slurp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Yahoo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Yahoo-Robot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Yahoo-MMCrawler [NC,OR]
RewriteCond %{REMOTE_HOST} \.yahoo\.com$ [NC,OR]
RewriteCond %{REMOTE_HOST} \.search\.mud\.yahoo\.com$ [NC]
RewriteCond %{REQUEST_URI}!^/robots\.txt$
RewriteRule \.(cgi¦pl¦mid¦wav¦hqx¦ZIP¦xml¦ico¦jpg¦gif¦txt)$ - [F,L]

[Note: Not for copy-pasting! This board's program alters crucial bits.]

.
6.) Bottom Line:

Month after month, I simply don't get enough traffic from Yahoo to justify the ever-increasing work, guesswork, and bandwidth. From July 1 to date, for Yahoo --

.inktomisearch.com HITS: 4499
(incl. 1425 robots.txt)
search.yahoo.com REFERERS: 419

And Google -- 3,406 less hits and 2,436 more ID'd referers:

.googlebot.com HITS: 1093
(incl. 261 robots.txt)
google... /search? REFERERS: 2855

'Nuff said. Sorry, Yahoo.

Mokita




msg:3012730
 6:17 am on Jul 18, 2006 (gmt 0)

The Yahoo! China Slurp was indeed following 'Slurp' user-agent rules in preference to 'Slurp China' rules. The Yahoo! China team have corrected that, so China Slurp will now observe its own specific rules instead of Slurp rules.

Yahoo! Slurp China has just violated our robots.txt, which contains both the following entries (to be sure one of them will work!)

User-agent: Yahoo! Slurp China
User-agent: Slurp China
Disallow: /

Logged:

lj910157.inktomisearch.com - - [18/Jul/2006:14:08:13 +1000] "GET /robots.txt HTTP/1.0" 200 1815 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
lj910193.inktomisearch.com - - [18/Jul/2006:14:08:23 +1000] "GET / HTTP/1.0" 403 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

Luckily I still had mod_rewrite blocking it.

Yahoo_Mike: Please would you confirm what is the correct syntax to successfully block Yahoo! Slurp China via robots.txt. Thanks.

--
P.S. I wrote to Yahoo! via the help form on their site more than two weeks ago, asking the same thing. I haven't had any reply, not even an automated one.

incrediBILL




msg:3012819
 8:33 am on Jul 18, 2006 (gmt 0)

Mokita, this is why you need a dynamic robots.txt only served up by user agent so if they see all bots are blocked and continue anyway, which forgoes any UA mixup, you know they are messed up for sure and can drop them in .htaccess.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved