jdMorgan - 5:35 am on Jul 14, 2006 (gmt 0) Thanks for the feedback, and for the actions your teams have taken.
The Yahoo! China Slurp was indeed following 'Slurp' user-agent rules in preference to 'Slurp China' rules. The Yahoo! China team have corrected that, so China Slurp will now observe its own specific rules instead of Slurp rules.
I hope that statement was an intentional simplification, as the correct behaviour would be to accept robots.txt records with User-agent tokens in the following priority:
Thanks for the feedback, and for the actions your teams have taken.
Is there a reason it cannot use the previously-collected Slurp dataset? Or the same crawler? Yahoo! is a big company, but may I humbly and earnestly suggest that you guard against the proliferation of too many agents and too many user-agent strings? I would generally object on principle to the additional "Slurp China" user-agent string, except for the 'special circumstances' that China requires.
Same thing for "Slurp DE" -- can it share the dataset? This one at least needs a more-specific name: My original thought was that is was from Germany (country-code "DE") -- perhaps also in response to that country's special requirements concerning certain search results. "DE" gave me no clue that it was related to the Yahoo! directory.
We had a discussion here years ago with a spider author who was upset about being massively banned, because he thought he'd done his due-diligence by providing identifying and contact info in the "HTTP_FROM" header. But the fact is that almost no Webmasters can see that header in their logs; It requires a custom log format, and that level of configuration is simply not available on common name-based virtual hosting accounts.
So anyway, I'll toss that UA-string-injection idea out for your team to consider: Find the last substring in the user's UA string bounded by ";" and ")", insert "; Yahoo! Babelfish/1.0" ahead of the ")" and you're good to go. (Of course, that assumes that the user's UA string is valid to begin with.) I'm suggesting this only if it doesn't make more sense to modify the referer, and only for 'translatable objects', not for images, stylesheets, and client-side scripts, which I presume just pass through the translation proxy unmodified.
I'm not familiar with the 'environment' in which this Overture User-agent operates -- To be blunt -- whether you can 'trust' the target sites. But for the majority of activity, I really come down on the side of those who say that all automated agents should identify themselves, even if just with a "(compatible; Yahoo! Sponsored Search)" or "(compatible; Overture)" in there. The plain "Mozilla/4.0" User-agent is commonly used for site scraping and other exploits, and is persona non-grata on many sites. Yahoo can use it on my sites, but only because the blocking rule contains IP and remote host exclusions to let Yahoo! in. So, I'd say identify yourself, spot-check for underhanded UA-based cloaking (if necessary) using a valid browser UA, and drop any subscriber that breaks your TOS. Rule of law and all that...
I hope these comments are useful. Thanks again to you and your teams for listening to feedback and taking action on our concerns.