Naughty Yahoo User Agents

Forum Moderators: open

Message Too Old, No Replies

Naughty Yahoo User Agents

Please post them here

GaryK

11:12 pm on Jun 7, 2006 (gmt 0)

I want to appeal to all of you who have reported problems with Yahoo! user agents that don't respect robots.txt to post those user agents here.

Through a side project of mine I have a contact at Yahoo! Engineering whom I contacted yesterday. He forwarded my e-mail to someone in search ops. That person requested I send him a list of user agents that aren't respecting robots.txt.

To me this is a unique opportunity to see if Yahoo! is serious about addressing this increasingly annoying issue. And thanks to Dan I have permission to deviate from our usual format to compile this list.

Thanks in advance for your help.

GaryK

9:00 pm on Jun 8, 2006 (gmt 0)

Come on folks. If you're serious about putting an end to this problem with Yahoo! I need your help.

incrediBILL

9:18 pm on Jun 8, 2006 (gmt 0)

I have a huge list of their agents but didn't annotate if they hit robots.txt or not at the time unfortunately.

Here's an odd one I'd like to know more about:

66.94.237.140 "" "/mypage.html" Proxy Detected -> VIA=1.1 proxy1.search.sc
d.yahoo.net:80 (squid/2.5.STABLE9) FORWARD=209.73.169.nnn CONNECT=
66.94.237.140 "" "/mypage2.html" Proxy Detected -> VIA=1.1 proxy1.search.scd.
yahoo.net:80 (squid/2.5.STABLE9) FORWARD=209.73.169.nnn CONNECT=
66.94.237.142 "" "/mypage3.html" Proxy Detected -> VIA=1.1 proxy3.search.scd.yahoo.net:80 (squid/2.5.STABLE9) FORWARD=209.73.169.nnn CONNECT=

Looks like something crawling thru a yahoo proxy from the Altavista IP block, is this internal? a real crawler? just someone surfing?

Any possibility we work it from the other side?

Could Yahoo give us a list of all their user agents and IP ranges where they are supposed to crawl from?

I don't mind letting Yahoo crawl, but it would be nice to know it's really yahoo and not WAP, customer crawling via proxy, cache, or something else spoofing them as I have bunch of items like this setting off alarms and getting "errors" from my traps.

GaryK

9:33 pm on Jun 8, 2006 (gmt 0)

I will ask them Bill. But what they really want is a list of user agents that we know have disrespected robots.txt.

This is what I've managed to find on my own and the threads where I found them:

[webmasterworld.com...]
Yahoo! Mindset
66.228.182.177
66.228.182.183
66.228.182.185
66.228.182.187
66.228.182.188
66.228.182.190

[webmasterworld.com...]
Mozilla/5.0 (compatible; Yahoo! DE Slurp; [help.yahoo.com...]
72.30.142.24

[webmasterworld.com...]
Mozilla/4.0
66.228.173.150

[webmasterworld.com...]
YRL_ODP_CRAWLER
Sorry no IP Address but rDNS: rlx-1-1-1.labs.corp.yahoo.com

[webmasterworld.com...]
mp3Spider cn-search-devel at yahoo-inc dot com
202.165.102.179

incrediBILL

9:35 pm on Jun 8, 2006 (gmt 0)

Oh yes, didn't see a robots request from this one @ shop.yahoo...

207.126.225.132 "GET /paqe1.html HTTP/1.1" 200 1449 "-" "libwww-perl/5.69"

Doesn't even identify itself as Yahoo but NSLOOKUP shows feed18.shop.corp.yahoo.com

vortech

11:04 pm on Jun 8, 2006 (gmt 0)

Here's one, but questionable. Are you looking for just Yahoo?

No robots txt.

144.223.247.130 PTR record: sl-overture-1-0.sprintlink.net

GET /Default.asp - 200 0 0 290 344 HTTP/1.1 www.mysite.com HTMLParser/1.6

Never seen it before.?

GaryK

11:36 pm on Jun 8, 2006 (gmt 0)

Yahoo! only please.

jdMorgan

12:44 am on Jun 9, 2006 (gmt 0)

Well, I'd like to see "Yahoo Slurp DE;" and "Yahoo! Slurp China;" respect a robots.txt user-agent string separate from that of Yahoo ("world," "international," "classic," or whatever). I'm perfectly happy having Slurp and Slurp DE crawling the site, but I have absolutely nothing to offer to China. As a matter of fact, I'd like to deny Slurp China on one site precisely because chinese surfers might end up in jail if they spent any time on that site (due to incompatible views of politics, individual rights, and censorship).

I'm a pragmatist, but if Yahoo sees fit to have separate spiders for these markets, then they should realize that Webmasters might like to handle them differently, too. But unfortunately, this doesn't work:

User-agent: Slurp China
Disallow: /
User-agent: Slurp DE
Disallow: /cgi-bin
User-agent: Slurp
Disallow: /cgi-bin

Slurp "international" sees the first line and leaves. Reversing the records doesn't seem to help, either. The 'help page' on Slurp China has no options for other languages, so I can only assume it's just a translation of the one on yahoo.com

Jim

Pfui

1:27 am on Jun 9, 2006 (gmt 0)

Okay, GaryK. You asked for it:)

1.) Here's a TON of Yahoo-related User-agents (plus fellow travelers) ignoring robots.txt over the past 6-8 months, pretty much excluding most (I hope) of the Yahoo UAs I've already posted about on WW. From umpteen Host names and IPs and countries all pointed toward a single site.

2.) I prefer Apache to show Host names so where you see IPs is where there was no Host used.

3.) Some of these are MyWeb2 (cacheing bookmarker) and phone-related but Yahoo still didn't ask for robots.txt before using a UA like libwww-perl.

4.) The sort is more by Host/IP than not but isn't alpha, sorry, because neither IPs, Hosts nor UAs lent themselves to a sensible listing -- plus there were too danged many of each!

5.) Keeping up with of all of these and still more Yahoo UAs, robots.txt-compliant or not, is really too much work for the traffic Yahoo sends. Way too much.

-----
-----
lj9054.inktomisearch.com
(multi)
Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

-----
dp131.data.yahoo.com
Mozilla/4.0

rlx-1-2-1.labs.corp.yahoo.com
Mozilla/4.0

-----
r17.mk.cnb.yahoo.com
m23.mk.cnb.yahoo.com
(multi)
Gaisbot/3.0+(robot05[@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)
Gaisbot/3.0+(robot06@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)

-----
urlc1.mail.mud.yahoo.com
urlc2.mail.mud.yahoo.com
urlc3.mail.mud.yahoo.com
urlc4.mail.mud.yahoo.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

-----
rlx-2-2-10.labs.corp.yahoo.com
Yahoo! Mindset

q02.yrl.dcn.yahoo.com
Yahoo! Mindset

-----
ts2.test.mail.mud.yahoo.com
(68.142.203.133)
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

-----
203.141.52.37
203.141.52.39
203.141.52.44
(multi)
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/blog-search/)

203.141.52.47
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)

ont211014008240.yahoo.co.jp
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)

-----
mmcrm4070.search.mud.yahoo.com
Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)

-----
proxy1.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1)
proxy1.search.dcn.yahoo.net
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1

proxy2.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie)
proxy2.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

proxy3.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie)
proxy3.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

-----
proxy1.search.dcn.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), libwww-perl/5.69
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

proxy2.search.dcn.yahoo.net
PostFavorites
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)
Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)

proxy3.search.dcn.yahoo.net
PostFavorites
Mozilla/5.0 (Windows; U; Win 9x 4.90; es-AR; rv:1.7.12) Gecko/20050919 Firefox/1.0.7
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

-----
msfp01.search.mud.yahoo.com
(side-scroll edited)
Nokia6682/2.0 (3.01.1) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 configuration/CLDC-1.1 UP.Link/6.3.0.0.0
(compatible; Windows CE; Blazer/4.0; PalmSource; MOT-V300; SEC-SGHE315;
YahooSeeker/MA-R2D2;mobile-search-customer-care AT yahoo-inc dot com)

mmcrm4070.search.mud.yahoo.com
Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)

opnprc1.search.mud.yahoo.com
Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; [help.yahoo.com...] )

-----
oc4.my.dcn.yahoo.com
YahooFeedSeeker/1.0 (compatible; Mozilla 4.0; MSIE 5.5; [publisher.yahoo.com...]

-----
-----
TROUBLE:

All referers beginning: "http://rds.yahoo.com/"

-----
-----
SPOOFS?

207-234-129-8.ptr.primarydns.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

202.139.36.72.reverse.layeredtech.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

(Aside: .layeredtech.com (now 403'd) routinely sends forth plagues of bots. Six different kinds just last month alone.)

###

incrediBILL

3:32 am on Jun 9, 2006 (gmt 0)

but I have absolutely nothing to offer to China

I thought the same but check your traffic before making such a rash response as china, hong kong and other similar countries surprisingly have more people that can read and speak English then the rest of the so-called English speaking countries combined.

SPOOFS?

Cloaked is more like it - I found a boatload of proxy sites that cloak directories to search engines and allow search engines to crawl thru their proxy and index pages which can result in some of your listings being hijacked. I've blocked layeredtech as a result, too much nonsense to deal with, and let legit crawlers thru one at a time.

jdMorgan

4:32 am on Jun 9, 2006 (gmt 0)

> I thought the same but check your traffic before making such a rash response as china, hong kong and other similar countries surprisingly have more people that can read and speak English then the rest of the so-called English speaking countries combined.

Just to be clear: The site would absolutely be off-limits to mainland chinese citizens under the current censorship rules. While a tiny percentage of them might have an academic interest in the site, they would likely go to jail if caught accessing it. And they would certainly not be allowed to have a practical interest in the subject it covers.

Therefore, in order to avoid 'tempting' them to commit a crime by accessing unauthorized information, and in order to save some bandwidth, I'd just as soon stay off the Slurp China spidering list. It's a cool-headed business and ethical decision -- Nothing rash about it, hsieh hsieh ni ("Thank you" in Mandarin).

From a technical viewpoint, I don't like the inconsistency of "Allowing" Slurp in robots.txt, and then 403'ing Slurp China, which assumes that it is also "Allowed" -- even though it feeds a separate index and uses a distinct User-agent.

Sorry for the OT post, and back to Yahoo 'bots...

Jim

GaryK

4:49 am on Jun 9, 2006 (gmt 0)

Thanks for your help so far. I think some limited but insightful comments might be helpful so it's alright. I intend to send this thread as-is, sans user names, to the guy in search ops.

Pfui, pardon the awful pun but about all I can think of is, phew, what a boatload of user agents. I'm sure it will be very helpful. Thanks

incrediBILL

5:02 am on Jun 9, 2006 (gmt 0)

OH YEAH, jdMorgan reminded me of this, even if the different spiders for China or Japan could be handled separately, the web pages the bots point to are nothing I can read so who knows if it's possible to block them individually or not. I did go look at those pages and if you aren't from the intended country, good luck trying to figure out the bot information.

Please pass this criticism along as I've found MANY native language bots that have multi-lingual text on the page discussing the bot, Yahoo should know better.

Pfui

6:12 am on Jun 9, 2006 (gmt 0)

Gary, happy to be able to assist and thank you for helping wrangle rude robots. Btw, you could save yourself some copy-pasting and just point your pal to this thread. It's not in the subscriber's/private forum so its contents will be available via Google soon enough, maybe even Yahoo...

(Then again with "Naughty" in the title, Yahoo!China may have a problem;)

GaryK

3:48 pm on Jun 9, 2006 (gmt 0)

Yahoo! has requested that I send them what I've gotten so far so that's what I'm doing. I've removed all screen names but kept all relevant comments.

Hopefully someone from Yahoo! or Inktomi will join WW and be able to read this thread. If not at least they have the essence of it.

If anyone else has user agents to report please let me know and I'll forward those too.

Thanks and I hope everyone has a great weekend.

GaryK

5:26 pm on Jun 9, 2006 (gmt 0)

I'm amazed at how quickly Y! responded to my initial list. I'll paraphrase what I was told so as not to violate WW's TOS.

The Y! China team admits that the mp3spider does not yet observe robots.txt. They are willing to accept requests to be excluded from the crawl by sending an e-mail to cn-search-devel@yahoo-inc.com. They are also being pushed hard to start respecting robots.txt.

I've also been told that all user agents with Slurp in them respect robots.txt. If any of them do not do this please post as much detail as possible so it can be investigated and corrected. I've been told that someone from Inktomi will be looking at the individual threads I referenced.

I'm told that "Yahoo! Slurp DE", "Yahoo! Slurp China" and "Yahoo! Slurp" do recognize distinct User-Agent rules if provided.

Apparently Yahoo! Slurp DE is the crawler for a (D)irectory (E)ngine service that crawls preferred content explicitly listed by Yahoo! Search content service partners.

Slurp DE will respect robots.txt rules for User-Agent: Slurp DE or User-Agent: Yahoo! Slurp DE. If those user agents are not listed Slurp DE will obey User-Agent: Slurp.

Yahoo! Slurp China also obeys robots.txt rules for User-Agent:
Slurp China or User-Agent: Yahoo! Slurp China. Again, if there is no explicit Slurp China rule it will follow the more generic User-Agent: Slurp rule.

If the above is not the case please post as many details as you can about the offense so it can be investigated and corrected.

I was pleasantly surprised to see them admit that with Y! growing so fast it's hard to maintain consistent central control over every division of the company. They're very much aware of this problem and are working hard to correct it.

I know it might sound like they're trying to placate us, but having worked in a large corporation for 20 years (I'm now retired) I can tell you it's often hard to coordinate policies and enforcement across all departments.

A directive might come down from higher up, but it's up to each department to implement the directive as they understand it.

Ideally there's someone in charge of oversight, but if often takes complaints about lack of adherence to the directive before anything is done about it.

Now that Y! is aware of these problems from a group of experienced webmasters I hope they'll do something about it.

And frankly, all we can do is hope this will be the case. IMO it's good that Y! is willing to listen to our complaints and appears to be attempting to do something to address them.

I hope someone from Y! or Inktomi will show up here and try to work with us. That might be unrealistic, but a guy can hope can't he? :)

Finally, for now at least, and in the interest of full disclosure, my contact at Y! Engineering has offered to send me a Y! shirt. I think that's very nice of him and I accepted the offer.

volatilegx

6:25 pm on Jun 9, 2006 (gmt 0)

Gary... Great work! Thanks for your efforts. Hopefully, Y! will pay more attention to robots.txt compliance.

jdMorgan

6:41 pm on Jun 9, 2006 (gmt 0)

I've also been told that all user agents with Slurp in them respect robots.txt. If any of them do not do this please post as much detail as possible so it can be investigated and corrected. I've been told that someone from Inktomi will be looking at the individual threads I referenced.
I'm told that "Yahoo! Slurp DE", "Yahoo! Slurp China" and "Yahoo! Slurp" do recognize distinct User-Agent rules if provided.
Apparently Yahoo! Slurp DE is the crawler for a (D)irectory (E)ngine service that crawls preferred content explicitly listed by Yahoo! Search content service partners.
Slurp DE will respect robots.txt rules for User-Agent: Slurp DE or User-Agent: Yahoo! Slurp DE. If those user agents are not listed Slurp DE will obey User-Agent: Slurp.
Yahoo! Slurp China also obeys robots.txt rules for User-Agent:
Slurp China or User-Agent: Yahoo! Slurp China. Again, if there is no explicit Slurp China rule it will follow the more generic User-Agent: Slurp rule.

My previous experience was that if you put

User-agent: Slurp China
Disallow: /

in robots.txt, that Slurp (international - non-china) would parse just the "Slurp" substring, accept it as applicable to Slurp (international), and then go away. The order of the "Slurp China" and "Slurp" User-agent records made no difference.

As an act of good faith, I will try again (hoping to not dump my sites from Yahoo (international) and report back. So, it was not a problem of Slurp China obeying robots.txt per se, but rather that Slurp (international) wouldn't differentiate "Slurp China" from just "Slurp", and any attempt to disallow/restrict Slurp China would be seen as disallowing/restricting Yahoo Slurp (international).

Jim

GaryK

7:52 pm on Jun 9, 2006 (gmt 0)

Thanks for giving me the chance to use your forum for this Dan. It's much appreciated.

Jim, if you can give me specific examples from your log files of this behavior I will forward them to Warren. Based on what you're reporting it seems as if there might be some flaw in the programming logic and that is something they can probably fix if we document it well enough. You're mighty brave to test this and I pray it doesn't backfire on you.

I mentioned that Google has several reps who are members of WW and often comment on confusing issues. I'm hoping that will offer some incentive for Y! to do the same thing. :)

jdMorgan

10:05 pm on Jun 9, 2006 (gmt 0)

Gary,

Apparently, they've done some work on Slurp, but (understandably) didn't care to acknowledge it in your correspondence with them. Slurp (international) now seems to recognize "Slurp China" as a separate User-agent, although it demonstrably did not do this in the past - as recently as two months ago. I don't have the time go dig back through all those raw logs to find the days where I saw Slurp (international) get confused and stop fetching because of a Slurp China Disallow record, but the following now seems to work for Slurp.

I will have to keep an eye on the server for the next Slurp China visit to see if the robots.txt is still correctly handled on the Slurp China side. To reiterate, my problem before was that I could not Disallow Slurp China without also inadvertently Disallowing Slurp (international), because Slurp (international) got confused and thought the Slurp China record applied to it.

Current robots.txt:

User-agent: Slurp China
Disallow: /
# Use meta robots
User-agent: Slurp
Crawl-delay: 3
Disallow: /dir1/file-prefix1
Disallow: /dir1/file-prefix2
Disallow: /dir1/file-prefix3
Disallow: /dir2/
Disallow: /dir3/subdir1/
Disallow: /dir4/
Disallow: /dir5/
Disallow: /dir6/subdir1/
Disallow: /dir7/subdir1/
Disallow: /dir8-prefix

Resulting raw log:

68.142.249.74 - - [09/Jun/2006:16:29:07 -0500] "GET /robots.txt HTTP/1.0" 200 11316 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
68.142.249.140 - - [09/Jun/2006:16:30:53 -0500] "GET /my_page.html HTTP/1.0" 304 - "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

So, this shows that Slurp read the "Slurp China" Disallow record, but correctly continued to fetch a page after finding the subsequent "Slurp" record which allows it access.

All filepath info above has been obscured.

Jim

jdMorgan

4:27 am on Jun 12, 2006 (gmt 0)

An update:

Well sorry to say, but Slurp China is still naughty, and ignores the "Disallow: /" record in my previous post.

202.160.180.79 - - [11/Jun/2006:12:39:35 -0400] "GET /robots.txt HTTP/1.0" 200 11316 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
202.160.180.129 - - [11/Jun/2006:12:39:38 -0400] "GET /some_page.html HTTP/1.0" 403 737 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
Interestingly, Slurp China is now uniquely indentifying itself as it fetches robots.txt; I don't think I've noticed it doing that before. I believe it used to fetch robots.txt as "Slurp" and then proceed to spider the site with the "Slurp China" user-agent string.
But we still have a problem with SLurp China not respecting robots.txt -- at least with the "Slurp China" name in the robots.txt record.
Jim

Pfui

9:04 am on Jun 12, 2006 (gmt 0)

I show that it's been Slurping robots.txt as China for a while, and for months typically in 'pairs' with assorted .html files -- all of which were both generically and specifically Disallowed it in robots.txt (regular Slurp has robots.txt-specified access).

But at first, you're right, Jim, it didn't ask for robots.txt by itself, but always within seconds of regular Slurp asking for same. That was when I first saw Slurp China, back in November, 2005:

lj9118.inktomisearch.com - - [17/Nov/2005:02:23:03 -0800] "GET /robots.txt HTTP/1.0" 302 213 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
lj9083.inktomisearch.com - - [17/Nov/2005:02:23:05 -0800] "GET /file.html HTTP/1.0" 302 213 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

Within a few weeks, it started asking for just robots.txt, using its own ID:

lj9119.inktomisearch.com - - [01/Dec/2005:08:41:09 -0800] "GET /robots.txt HTTP/1.0" 200 3990 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

And thereafter, robots.txt plus a single (and robots.txt-Disallowed) file, akin to your excerpt:

lj9119.inktomisearch.com - - [14/Feb/2006:23:42:30 -0800] "GET /robots.txt HTTP/1.0" 200 6401 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
lj9062.inktomisearch.com - - [14/Feb/2006:23:42:37 -0800] "GET /dir/file.html HTTP/1.0" 200 29109 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

I tried just letting it have robots.txt and Forbidding (not just Disallowing) all else but it didn't miss a beat. So for months now, I've 403'd it re everything.

Yet still it comes:

lj910179.inktomisearch.com - - [11/Jun/2006:23:37:14 -0700] "GET /robots.txt HTTP/1.0" 403 803 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
lj910053.inktomisearch.com - - [11/Jun/2006:23:37:21 -0700] "GET /dir/file.html HTTP/1.0" 403 803 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

And as mentioned, from Day One, its info page has been inaccessible to the Chinese font- and language-challenged.

If Slurp China hadn't been a Yahoo spawn, it would've been a goner six months ago. But because of its heritage, I tried to work with it, and/or around it. However, as a direct result of its behavior, nowadays I have far less tolerance for all of Yahoo's countless UAs/IPs/Hosts and their seemingly Yahoo-beneficial screw-ups.

.
P.S.

A few more oddities for your compilation, GaryK:

dcf1.labs.corp.yahoo.com
NO UA

demo03.labs.corp.yahoo.com
NO UA

search1.labs.corp.yahoo.com
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

And one more curious lineage [webmasterworld.com] --

GaryK

4:34 pm on Jun 12, 2006 (gmt 0)

Jim thanks for taking the risk. Did you try the Yahoo! Slurp China variant? I wonder if that makes any difference?

Pfui, I'll send your comments, and Jim's as well to the guy at Inktomi.

I really wish I could get them to read this thread. Maybe they have. Maybe they just need more time. I get the impression there are a lot of different divisions of Y! involved with this so it might take some time for each division to fix their own problems.

I'm wondering what to do in the interim. I have no desire to let any user agents visit my sites from China. I wonder sometimes if that's a fair thing to do. If someone wants to visit one of my sites from China shouldn't it be that person's responsibility to deal with the ramifications? I'm providing a service, but not a babysitting service. Therefore if Y! will be well-behaved I almost see it as discrimination if I don't let them index my sites.

incrediBILL

4:49 pm on Jun 12, 2006 (gmt 0)

One way to deal with this problem is the same way WebmasterWorld deals with the problem by making a dynamic robots.txt file. Only show allowed bots a valid robots.txt and everything else sees all robots denied. Anything not in your whitelist that continues to crawl anyway isn't honoring robots.txt at all and it's even easier to spot.

There's a few dynamic robots.txt perl scripts out on the net with instructions on implementation if anyone wants to give it a try ;)

fiestagirl

4:56 pm on Jun 12, 2006 (gmt 0)

I don't think this one has been mentioned.

66.228.165.139-66.228.165.159

mozilla/4.5 [en] (win98; i)

ampt3.yst.corp.yahoo.com

jdMorgan

5:16 pm on Jun 12, 2006 (gmt 0)

> Did you try the Yahoo! Slurp China variant? I wonder if that makes any difference?

No, I'm not going to experiment with a live site beyond what I've done -- Yahoo said it should work with "Slurp" and/or "Slurp China" and that's been disproven in my test. The ball is in their court, unless they want to hire me as a consultant <joke>.

They need a serious re-write of their robots.txt page to fully explain the various crawler variants, their functions, their robots.txt names, and the interactions between robots.txt records when using multiple Slurp control records. This page should be as universal as possible, and as noted by IncrediBill and myself above, should be available in multiple languages.

Jim

GaryK

6:02 pm on Jun 12, 2006 (gmt 0)

Bill, I do use a dynamic robots.txt file. As I mentioned earlier I use IIS, but ISAPI_Rewrite allows for URL rewriting so when someone requests robots.txt I check to be sure they're on my whitelist and using a valid IP Address from Dan's list. Anything that's not on my whitelist is disallowed. If it continues to take files I've got two levels of spider traps to stop it. BTW, I tried to send you a sticky but your mailbox seems to be full. WW status messages are about as clear as MS error messages. ;) I need to contact you off-site. If you're so inclined you can use the contact form on my website. The URL is in my profile.

Thanks, fiestagirl. Looks like I've got to bug Warren again. I'll keep bugging him, politely of course. Would it be too egotistical to state that if enough webmasters from WW ban Y! it will hurt them more than it will hurt us? Probably, but that's how I feel right now. :)

Jim, I appreciate your concerns about testing things on live sites. What can I do to attract some of these user agents? I've got a few sites that I use for testing things and it doesn't matter to me what happens to them. None of the major SEs ever visit those sites. I know I could get them to visit if I provide a link from one of my real sites but like you I'm hesitant to take that risk.

jdMorgan

6:08 am on Jun 13, 2006 (gmt 0)

If *all* the webmasters who read here at WebmasterWorld banned *all* Yahoo user-agents...
Yahoo probably wouldn't notice.

That 'attention-getting' tactic just isn't likely to work. I'm willing to accept that Yahoo! and all the other major search engines make a good-faith effort to comply with robots.txt, but that coding errors, bugs, database disconnects, and misunderstandings of the 'protocol' do happen.

The only reason I ban any major 'bot from any page or cloak any page is to keep that page out of the index. And the only reasons I do that are:

Dangerous content: The PRC has severe penalties against things that are perfectly legal to do or discuss in the U.S. I don't want some poor schmuck going to prison for stumbling onto my site.

Controlling entry click-paths and context: Some pages make little sense if you miss the previous page.

Limited-distribution information: If you're on the site, and reading at depth, here's the info and the graphics. If you're scraping for contact info or membernames or multimedia using search, they're not there.

Bot traps: Trying to keep the majors from banning themselves, especially when they fire up a new IP adress block.

Others that slip my mind right now.

Bottom line is that I'm a realist and a pragmatist; This is business. So I don't ban anybody out of malice or spite. I just decide if I need their traffic or not, and if not, 403. If Yahoo! were to publish a statement that they intended to disregard robots.txt in the future, I still wouldn't ban them. But they'd be seeing a heckuva lot more in the Vary: User-agent class... ;)

I posted the exact structure of robots.txt that Slurp China is choking on above, with the URLs obscured to comply with the WebmasterWorld TOS and my own desire for privacy. But other than those changes, the example is a letter-perfect rendition of my actual code. I think Yahoo! can easily test it themselves, if they're so inclined.

Also, the problem is in parsing User-agent names, most likely. Anybody could do a 'less risky' test by disallowing just a single URL-path to Slurp China if they wanted to. I suspect they'd see the same failure I did.

Jim

GaryK

4:51 pm on Jun 13, 2006 (gmt 0)

Jim I sort of regret my statement from yesterday. I think someone from Y! saw it because nobody is replying to me anymore. I hope I didn't make our situation worse with a statement I should have given more thought to before making it. Me and my big mouth have gotten me into trouble before. You'd think that at age 50 I would have learned that lesson by now.

jdMorgan

5:37 pm on Jun 13, 2006 (gmt 0)

I suspect that Y! employees, just like G and MSN employees who read here, have thicker skins than that. At most, that sort of statement should evoke a forced/muted grin with averted eyes, as if someone threatened to throw all their cash on the ground and walk away because they didn't like the words printed on the bills or something...

I personally hope they ignore all of the whining and hand-wringing that goes on in this and other forums; The heading on this site says it's for "Web professionals," after all. The important part is to extract the technical information (if any) address that (publically or privately), and ignore the emotional stuff.

"Experience is what allows you to recognize a mistake when you make it again"... :)

It's not so important that they answer, but rather that they look into this Slurp China problem and either fix their code or publish comprehensive and correct robots.txt info for Webmasters wishing to distinguish between the various versions of Slurp. It's always surprising to me to find problems with robots and their documentation; Without a working robot and Webmasters who understand it's needs/requirements with respect to robots.txt files, all the rest of the overlaying search engine effort is potentially crippled.

Hope for the best, plan for the worst, and don't let it get you down... (It's only castles burning (Yeah, showing my 'experience' there, too)).

Jim

This 45 message thread spans 2 pages: 45