-- Search Engine Spider and User Agent Identification
---- Naughty Yahoo User Agents
jdMorgan - 6:41 pm on Jun 9, 2006 (gmt 0)
I've also been told that all user agents with Slurp in them respect robots.txt. If any of them do not do this please post as much detail as possible so it can be investigated and corrected. I've been told that someone from Inktomi will be looking at the individual threads I referenced.
I'm told that "Yahoo! Slurp DE", "Yahoo! Slurp China" and "Yahoo! Slurp" do recognize distinct User-Agent rules if provided.
Apparently Yahoo! Slurp DE is the crawler for a (D)irectory (E)ngine service that crawls preferred content explicitly listed by Yahoo! Search content service partners.
Slurp DE will respect robots.txt rules for User-Agent: Slurp DE or User-Agent: Yahoo! Slurp DE. If those user agents are not listed Slurp DE will obey User-Agent: Slurp.
Yahoo! Slurp China also obeys robots.txt rules for User-Agent: Slurp China or User-Agent: Yahoo! Slurp China. Again, if there is no explicit Slurp China rule it will follow the more generic User-Agent: Slurp rule.
My previous experience was that if you put
User-agent: Slurp China Disallow: /
in robots.txt, that Slurp (international - non-china) would parse just the "Slurp" substring, accept it as applicable to Slurp (international), and then go away. The order of the "Slurp China" and "Slurp" User-agent records made no difference.
As an act of good faith, I will try again (hoping to not dump my sites from Yahoo (international) and report back. So, it was not a problem of Slurp China obeying robots.txt per se, but rather that Slurp (international) wouldn't differentiate "Slurp China" from just "Slurp", and any attempt to disallow/restrict Slurp China would be seen as disallowing/restricting Yahoo Slurp (international).