Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Yahoo! Crawlers - A response from Yahoo! Search


jdMorgan - 5:35 am on Jul 14, 2006 (gmt 0)


Yahoo_Mike,

Thanks for the feedback, and for the actions your teams have taken.

The Yahoo! China Slurp was indeed following 'Slurp' user-agent rules in preference to 'Slurp China' rules. The Yahoo! China team have corrected that, so China Slurp will now observe its own specific rules instead of Slurp rules.

I hope that statement was an intentional simplification, as the correct behaviour would be to accept robots.txt records with User-agent tokens in the following priority:


The Yahoo! Mindset agent reads pages for an 'Intent-driven Search' beta at [mindset.research.yahoo.com...] The Mindset agent visits pages already included in Yahoo! search result listings and does not do any extraction crawling nor content refreshes. In any case, the Mindset team has taken their robot out of service until it is corrected to observe /robots.txt exclusions.

Is there a reason it cannot use the previously-collected Slurp dataset? Or the same crawler? Yahoo! is a big company, but may I humbly and earnestly suggest that you guard against the proliferation of too many agents and too many user-agent strings? I would generally object on principle to the additional "Slurp China" user-agent string, except for the 'special circumstances' that China requires.

Same thing for "Slurp DE" -- can it share the dataset? This one at least needs a more-specific name: My original thought was that is was from Germany (country-code "DE") -- perhaps also in response to that country's special requirements concerning certain search results. "DE" gave me no clue that it was related to the Yahoo! directory.

Visits from clients at proxyn.search.dcn.yahoo.net or proxyn.search.acd.yahoo.net are not crawler activity but proxies for browser page views using a 'translate this page' link from Yahoo! search results. Babelfish provides the automated translation. The Babelfish team will be modifying their proxy headers to more accurately reflect the page access.

This kind of agent (language and markup tranlator) is the most problematic. Because most webmasters only have access to the HTTP headers available in NCSA extended/combined log format (Referer and User-agent), I'd argue in favor of 'injecting' an additional parameter into the translation proxy user's User-agent string, or modifying the Referer header. This because most webmasters won't even be aware of the proxy-related headers like "HTTP_VIA", "HTTP_X_FORWARDED_FOR", etc. -- or have the knowledge or capability to test them.

We had a discussion here years ago with a spider author who was upset about being massively banned, because he thought he'd done his due-diligence by providing identifying and contact info in the "HTTP_FROM" header. But the fact is that almost no Webmasters can see that header in their logs; It requires a custom log format, and that level of configuration is simply not available on common name-based virtual hosting accounts.

So anyway, I'll toss that UA-string-injection idea out for your team to consider: Find the last substring in the user's UA string bounded by ";" and ")", insert "; Yahoo! Babelfish/1.0" ahead of the ")" and you're good to go. (Of course, that assumes that the user's UA string is valid to begin with.) I'm suggesting this only if it doesn't make more sense to modify the referer, and only for 'translatable objects', not for images, stylesheets, and client-side scripts, which I presume just pass through the translation proxy unmodified.

The 'Mozilla/4.0' UA from Overture Services (like 66.228.173.150 as described in [webmasterworld.com...] ) is an agent for editorial checking from Yahoo! Search Marketing (Overture). This editorial agent only reads URLs submitted by advertisers for sponsored search listing by Yahoo! Search Marketing (YSM); it does no extraction and is not a public content crawler. Information about the editorial agent is in the terms of service documents for advertisers listing with Yahoo! Search Marketing. [...] This YSM agent was updated the week of June 5 to rate limit its activity better after an advertiser does a batch submit.

I'm not familiar with the 'environment' in which this Overture User-agent operates -- To be blunt -- whether you can 'trust' the target sites. But for the majority of activity, I really come down on the side of those who say that all automated agents should identify themselves, even if just with a "(compatible; Yahoo! Sponsored Search)" or "(compatible; Overture)" in there. The plain "Mozilla/4.0" User-agent is commonly used for site scraping and other exploits, and is persona non-grata on many sites. Yahoo can use it on my sites, but only because the blocking rule contains IP and remote host exclusions to let Yahoo! in. So, I'd say identify yourself, spot-check for underhanded UA-based cloaking (if necessary) using a valid browser UA, and drop any subscriber that breaks your TOS. Rule of law and all that...

I hope these comments are useful. Thanks again to you and your teams for listening to feedback and taking action on our concerns.

Jim


Thread source:: http://www.webmasterworld.com/search_engine_spiders/3006509.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com