Forum Moderators: open
Through a side project of mine I have a contact at Yahoo! Engineering whom I contacted yesterday. He forwarded my e-mail to someone in search ops. That person requested I send him a list of user agents that aren't respecting robots.txt.
To me this is a unique opportunity to see if Yahoo! is serious about addressing this increasingly annoying issue. And thanks to Dan I have permission to deviate from our usual format to compile this list.
Thanks in advance for your help.
Here's an odd one I'd like to know more about:
66.94.237.140 "" "/mypage.html" Proxy Detected -> VIA=1.1 proxy1.search.sc
d.yahoo.net:80 (squid/2.5.STABLE9) FORWARD=209.73.169.nnn CONNECT=
66.94.237.140 "" "/mypage2.html" Proxy Detected -> VIA=1.1 proxy1.search.scd.
yahoo.net:80 (squid/2.5.STABLE9) FORWARD=209.73.169.nnn CONNECT=
66.94.237.142 "" "/mypage3.html" Proxy Detected -> VIA=1.1 proxy3.search.scd.yahoo.net:80 (squid/2.5.STABLE9) FORWARD=209.73.169.nnn CONNECT=
Looks like something crawling thru a yahoo proxy from the Altavista IP block, is this internal? a real crawler? just someone surfing?
Any possibility we work it from the other side?
Could Yahoo give us a list of all their user agents and IP ranges where they are supposed to crawl from?
I don't mind letting Yahoo crawl, but it would be nice to know it's really yahoo and not WAP, customer crawling via proxy, cache, or something else spoofing them as I have bunch of items like this setting off alarms and getting "errors" from my traps.
This is what I've managed to find on my own and the threads where I found them:
[webmasterworld.com...]
Yahoo! Mindset
66.228.182.177
66.228.182.183
66.228.182.185
66.228.182.187
66.228.182.188
66.228.182.190
[webmasterworld.com...]
Mozilla/5.0 (compatible; Yahoo! DE Slurp; [help.yahoo.com...]
72.30.142.24
[webmasterworld.com...]
Mozilla/4.0
66.228.173.150
[webmasterworld.com...]
YRL_ODP_CRAWLER
Sorry no IP Address but rDNS: rlx-1-1-1.labs.corp.yahoo.com
[webmasterworld.com...]
mp3Spider cn-search-devel at yahoo-inc dot com
202.165.102.179
I'm a pragmatist, but if Yahoo sees fit to have separate spiders for these markets, then they should realize that Webmasters might like to handle them differently, too. But unfortunately, this doesn't work:
User-agent: Slurp China
Disallow: /User-agent: Slurp DE
Disallow: /cgi-binUser-agent: Slurp
Disallow: /cgi-bin
Jim
1.) Here's a TON of Yahoo-related User-agents (plus fellow travelers) ignoring robots.txt over the past 6-8 months, pretty much excluding most (I hope) of the Yahoo UAs I've already posted about on WW. From umpteen Host names and IPs and countries all pointed toward a single site.
2.) I prefer Apache to show Host names so where you see IPs is where there was no Host used.
3.) Some of these are MyWeb2 (cacheing bookmarker) and phone-related but Yahoo still didn't ask for robots.txt before using a UA like libwww-perl.
4.) The sort is more by Host/IP than not but isn't alpha, sorry, because neither IPs, Hosts nor UAs lent themselves to a sensible listing -- plus there were too danged many of each!
5.) Keeping up with of all of these and still more Yahoo UAs, robots.txt-compliant or not, is really too much work for the traffic Yahoo sends. Way too much.
-----
-----
lj9054.inktomisearch.com
(multi)
Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
-----
dp131.data.yahoo.com
Mozilla/4.0
rlx-1-2-1.labs.corp.yahoo.com
Mozilla/4.0
-----
r17.mk.cnb.yahoo.com
m23.mk.cnb.yahoo.com
(multi)
Gaisbot/3.0+(robot05[@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)
Gaisbot/3.0+(robot06@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)
-----
urlc1.mail.mud.yahoo.com
urlc2.mail.mud.yahoo.com
urlc3.mail.mud.yahoo.com
urlc4.mail.mud.yahoo.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
-----
rlx-2-2-10.labs.corp.yahoo.com
Yahoo! Mindset
q02.yrl.dcn.yahoo.com
Yahoo! Mindset
-----
ts2.test.mail.mud.yahoo.com
(68.142.203.133)
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
-----
203.141.52.37
203.141.52.39
203.141.52.44
(multi)
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/blog-search/)
203.141.52.47
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
ont211014008240.yahoo.co.jp
Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
-----
mmcrm4070.search.mud.yahoo.com
Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)
-----
proxy1.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1)
proxy1.search.dcn.yahoo.net
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1
proxy2.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie)
proxy2.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
proxy3.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie)
proxy3.search.scd.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
-----
proxy1.search.dcn.yahoo.net
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0), libwww-perl/5.69
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
proxy2.search.dcn.yahoo.net
PostFavorites
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)
Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)
proxy3.search.dcn.yahoo.net
PostFavorites
Mozilla/5.0 (Windows; U; Win 9x 4.90; es-AR; rv:1.7.12) Gecko/20050919 Firefox/1.0.7
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
-----
msfp01.search.mud.yahoo.com
(side-scroll edited)
Nokia6682/2.0 (3.01.1) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 configuration/CLDC-1.1 UP.Link/6.3.0.0.0
(compatible; Windows CE; Blazer/4.0; PalmSource; MOT-V300; SEC-SGHE315;
YahooSeeker/MA-R2D2;mobile-search-customer-care AT yahoo-inc dot com)
mmcrm4070.search.mud.yahoo.com
Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)
opnprc1.search.mud.yahoo.com
Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; [help.yahoo.com...] )
-----
oc4.my.dcn.yahoo.com
YahooFeedSeeker/1.0 (compatible; Mozilla 4.0; MSIE 5.5; [publisher.yahoo.com...]
-----
-----
TROUBLE:
All referers beginning: "http://rds.yahoo.com/"
-----
-----
SPOOFS?
207-234-129-8.ptr.primarydns.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
202.139.36.72.reverse.layeredtech.com
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
(Aside: .layeredtech.com (now 403'd) routinely sends forth plagues of bots. Six different kinds just last month alone.)
###
but I have absolutely nothing to offer to China
I thought the same but check your traffic before making such a rash response as china, hong kong and other similar countries surprisingly have more people that can read and speak English then the rest of the so-called English speaking countries combined.
SPOOFS?
Cloaked is more like it - I found a boatload of proxy sites that cloak directories to search engines and allow search engines to crawl thru their proxy and index pages which can result in some of your listings being hijacked. I've blocked layeredtech as a result, too much nonsense to deal with, and let legit crawlers thru one at a time.
Just to be clear: The site would absolutely be off-limits to mainland chinese citizens under the current censorship rules. While a tiny percentage of them might have an academic interest in the site, they would likely go to jail if caught accessing it. And they would certainly not be allowed to have a practical interest in the subject it covers.
Therefore, in order to avoid 'tempting' them to commit a crime by accessing unauthorized information, and in order to save some bandwidth, I'd just as soon stay off the Slurp China spidering list. It's a cool-headed business and ethical decision -- Nothing rash about it, hsieh hsieh ni ("Thank you" in Mandarin).
From a technical viewpoint, I don't like the inconsistency of "Allowing" Slurp in robots.txt, and then 403'ing Slurp China, which assumes that it is also "Allowed" -- even though it feeds a separate index and uses a distinct User-agent.
Sorry for the OT post, and back to Yahoo 'bots...
Jim
Pfui, pardon the awful pun but about all I can think of is, phew, what a boatload of user agents. I'm sure it will be very helpful. Thanks
Please pass this criticism along as I've found MANY native language bots that have multi-lingual text on the page discussing the bot, Yahoo should know better.
(Then again with "Naughty" in the title, Yahoo!China may have a problem;)
Hopefully someone from Yahoo! or Inktomi will join WW and be able to read this thread. If not at least they have the essence of it.
If anyone else has user agents to report please let me know and I'll forward those too.
Thanks and I hope everyone has a great weekend.
The Y! China team admits that the mp3spider does not yet observe robots.txt. They are willing to accept requests to be excluded from the crawl by sending an e-mail to cn-search-devel@yahoo-inc.com. They are also being pushed hard to start respecting robots.txt.
I've also been told that all user agents with Slurp in them respect robots.txt. If any of them do not do this please post as much detail as possible so it can be investigated and corrected. I've been told that someone from Inktomi will be looking at the individual threads I referenced.
I'm told that "Yahoo! Slurp DE", "Yahoo! Slurp China" and "Yahoo! Slurp" do recognize distinct User-Agent rules if provided.
Apparently Yahoo! Slurp DE is the crawler for a (D)irectory (E)ngine service that crawls preferred content explicitly listed by Yahoo! Search content service partners.
Slurp DE will respect robots.txt rules for User-Agent: Slurp DE or User-Agent: Yahoo! Slurp DE. If those user agents are not listed Slurp DE will obey User-Agent: Slurp.
Yahoo! Slurp China also obeys robots.txt rules for User-Agent:
Slurp China or User-Agent: Yahoo! Slurp China. Again, if there is no explicit Slurp China rule it will follow the more generic User-Agent: Slurp rule.
If the above is not the case please post as many details as you can about the offense so it can be investigated and corrected.
I was pleasantly surprised to see them admit that with Y! growing so fast it's hard to maintain consistent central control over every division of the company. They're very much aware of this problem and are working hard to correct it.
I know it might sound like they're trying to placate us, but having worked in a large corporation for 20 years (I'm now retired) I can tell you it's often hard to coordinate policies and enforcement across all departments.
A directive might come down from higher up, but it's up to each department to implement the directive as they understand it.
Ideally there's someone in charge of oversight, but if often takes complaints about lack of adherence to the directive before anything is done about it.
Now that Y! is aware of these problems from a group of experienced webmasters I hope they'll do something about it.
And frankly, all we can do is hope this will be the case. IMO it's good that Y! is willing to listen to our complaints and appears to be attempting to do something to address them.
I hope someone from Y! or Inktomi will show up here and try to work with us. That might be unrealistic, but a guy can hope can't he? :)
Finally, for now at least, and in the interest of full disclosure, my contact at Y! Engineering has offered to send me a Y! shirt. I think that's very nice of him and I accepted the offer.
I've also been told that all user agents with Slurp in them respect robots.txt. If any of them do not do this please post as much detail as possible so it can be investigated and corrected. I've been told that someone from Inktomi will be looking at the individual threads I referenced.I'm told that "Yahoo! Slurp DE", "Yahoo! Slurp China" and "Yahoo! Slurp" do recognize distinct User-Agent rules if provided.
Apparently Yahoo! Slurp DE is the crawler for a (D)irectory (E)ngine service that crawls preferred content explicitly listed by Yahoo! Search content service partners.
Slurp DE will respect robots.txt rules for User-Agent: Slurp DE or User-Agent: Yahoo! Slurp DE. If those user agents are not listed Slurp DE will obey User-Agent: Slurp.
Yahoo! Slurp China also obeys robots.txt rules for User-Agent:
Slurp China or User-Agent: Yahoo! Slurp China. Again, if there is no explicit Slurp China rule it will follow the more generic User-Agent: Slurp rule.
My previous experience was that if you put
User-agent: Slurp China
Disallow: /
As an act of good faith, I will try again (hoping to not dump my sites from Yahoo (international) and report back. So, it was not a problem of Slurp China obeying robots.txt per se, but rather that Slurp (international) wouldn't differentiate "Slurp China" from just "Slurp", and any attempt to disallow/restrict Slurp China would be seen as disallowing/restricting Yahoo Slurp (international).
Jim
Jim, if you can give me specific examples from your log files of this behavior I will forward them to Warren. Based on what you're reporting it seems as if there might be some flaw in the programming logic and that is something they can probably fix if we document it well enough. You're mighty brave to test this and I pray it doesn't backfire on you.
I mentioned that Google has several reps who are members of WW and often comment on confusing issues. I'm hoping that will offer some incentive for Y! to do the same thing. :)
Apparently, they've done some work on Slurp, but (understandably) didn't care to acknowledge it in your correspondence with them. Slurp (international) now seems to recognize "Slurp China" as a separate User-agent, although it demonstrably did not do this in the past - as recently as two months ago. I don't have the time go dig back through all those raw logs to find the days where I saw Slurp (international) get confused and stop fetching because of a Slurp China Disallow record, but the following now seems to work for Slurp.
I will have to keep an eye on the server for the next Slurp China visit to see if the robots.txt is still correctly handled on the Slurp China side. To reiterate, my problem before was that I could not Disallow Slurp China without also inadvertently Disallowing Slurp (international), because Slurp (international) got confused and thought the Slurp China record applied to it.
Current robots.txt:
User-agent: Slurp China
Disallow: /# Use meta robots
User-agent: Slurp
Crawl-delay: 3
Disallow: /dir1/file-prefix1
Disallow: /dir1/file-prefix2
Disallow: /dir1/file-prefix3
Disallow: /dir2/
Disallow: /dir3/subdir1/
Disallow: /dir4/
Disallow: /dir5/
Disallow: /dir6/subdir1/
Disallow: /dir7/subdir1/
Disallow: /dir8-prefix
Resulting raw log:
68.142.249.74 - - [09/Jun/2006:16:29:07 -0500] "GET /robots.txt HTTP/1.0" 200 11316 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
68.142.249.140 - - [09/Jun/2006:16:30:53 -0500] "GET /my_page.html HTTP/1.0" 304 - "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
All filepath info above has been obscured.
Jim
Well sorry to say, but Slurp China is still naughty, and ignores the "Disallow: /" record in my previous post.
202.160.180.79 - - [11/Jun/2006:12:39:35 -0400] "GET /robots.txt HTTP/1.0" 200 11316 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
202.160.180.129 - - [11/Jun/2006:12:39:38 -0400] "GET /some_page.html HTTP/1.0" 403 737 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]Interestingly, Slurp China is now uniquely indentifying itself as it fetches robots.txt; I don't think I've noticed it doing that before. I believe it used to fetch robots.txt as "Slurp" and then proceed to spider the site with the "Slurp China" user-agent string.
But we still have a problem with SLurp China not respecting robots.txt -- at least with the "Slurp China" name in the robots.txt record.
Jim
But at first, you're right, Jim, it didn't ask for robots.txt by itself, but always within seconds of regular Slurp asking for same. That was when I first saw Slurp China, back in November, 2005:
lj9118.inktomisearch.com - - [17/Nov/2005:02:23:03 -0800] "GET /robots.txt HTTP/1.0" 302 213 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
lj9083.inktomisearch.com - - [17/Nov/2005:02:23:05 -0800] "GET /file.html HTTP/1.0" 302 213 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
Within a few weeks, it started asking for just robots.txt, using its own ID:
lj9119.inktomisearch.com - - [01/Dec/2005:08:41:09 -0800] "GET /robots.txt HTTP/1.0" 200 3990 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
And thereafter, robots.txt plus a single (and robots.txt-Disallowed) file, akin to your excerpt:
lj9119.inktomisearch.com - - [14/Feb/2006:23:42:30 -0800] "GET /robots.txt HTTP/1.0" 200 6401 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
lj9062.inktomisearch.com - - [14/Feb/2006:23:42:37 -0800] "GET /dir/file.html HTTP/1.0" 200 29109 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
I tried just letting it have robots.txt and Forbidding (not just Disallowing) all else but it didn't miss a beat. So for months now, I've 403'd it re everything.
Yet still it comes:
lj910179.inktomisearch.com - - [11/Jun/2006:23:37:14 -0700] "GET /robots.txt HTTP/1.0" 403 803 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
lj910053.inktomisearch.com - - [11/Jun/2006:23:37:21 -0700] "GET /dir/file.html HTTP/1.0" 403 803 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
And as mentioned, from Day One, its info page has been inaccessible to the Chinese font- and language-challenged.
If Slurp China hadn't been a Yahoo spawn, it would've been a goner six months ago. But because of its heritage, I tried to work with it, and/or around it. However, as a direct result of its behavior, nowadays I have far less tolerance for all of Yahoo's countless UAs/IPs/Hosts and their seemingly Yahoo-beneficial screw-ups.
.
P.S.
A few more oddities for your compilation, GaryK:
dcf1.labs.corp.yahoo.com
NO UA
demo03.labs.corp.yahoo.com
NO UA
search1.labs.corp.yahoo.com
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225
And one more curious lineage [webmasterworld.com] --
Pfui, I'll send your comments, and Jim's as well to the guy at Inktomi.
I really wish I could get them to read this thread. Maybe they have. Maybe they just need more time. I get the impression there are a lot of different divisions of Y! involved with this so it might take some time for each division to fix their own problems.
I'm wondering what to do in the interim. I have no desire to let any user agents visit my sites from China. I wonder sometimes if that's a fair thing to do. If someone wants to visit one of my sites from China shouldn't it be that person's responsibility to deal with the ramifications? I'm providing a service, but not a babysitting service. Therefore if Y! will be well-behaved I almost see it as discrimination if I don't let them index my sites.
There's a few dynamic robots.txt perl scripts out on the net with instructions on implementation if anyone wants to give it a try ;)
No, I'm not going to experiment with a live site beyond what I've done -- Yahoo said it should work with "Slurp" and/or "Slurp China" and that's been disproven in my test. The ball is in their court, unless they want to hire me as a consultant <joke>.
They need a serious re-write of their robots.txt page to fully explain the various crawler variants, their functions, their robots.txt names, and the interactions between robots.txt records when using multiple Slurp control records. This page should be as universal as possible, and as noted by IncrediBill and myself above, should be available in multiple languages.
Jim
Thanks, fiestagirl. Looks like I've got to bug Warren again. I'll keep bugging him, politely of course. Would it be too egotistical to state that if enough webmasters from WW ban Y! it will hurt them more than it will hurt us? Probably, but that's how I feel right now. :)
Jim, I appreciate your concerns about testing things on live sites. What can I do to attract some of these user agents? I've got a few sites that I use for testing things and it doesn't matter to me what happens to them. None of the major SEs ever visit those sites. I know I could get them to visit if I provide a link from one of my real sites but like you I'm hesitant to take that risk.
That 'attention-getting' tactic just isn't likely to work. I'm willing to accept that Yahoo! and all the other major search engines make a good-faith effort to comply with robots.txt, but that coding errors, bugs, database disconnects, and misunderstandings of the 'protocol' do happen.
The only reason I ban any major 'bot from any page or cloak any page is to keep that page out of the index. And the only reasons I do that are:
Bottom line is that I'm a realist and a pragmatist; This is business. So I don't ban anybody out of malice or spite. I just decide if I need their traffic or not, and if not, 403. If Yahoo! were to publish a statement that they intended to disregard robots.txt in the future, I still wouldn't ban them. But they'd be seeing a heckuva lot more in the Vary: User-agent class... ;)
I posted the exact structure of robots.txt that Slurp China is choking on above, with the URLs obscured to comply with the WebmasterWorld TOS and my own desire for privacy. But other than those changes, the example is a letter-perfect rendition of my actual code. I think Yahoo! can easily test it themselves, if they're so inclined.
Also, the problem is in parsing User-agent names, most likely. Anybody could do a 'less risky' test by disallowing just a single URL-path to Slurp China if they wanted to. I suspect they'd see the same failure I did.
Jim
I personally hope they ignore all of the whining and hand-wringing that goes on in this and other forums; The heading on this site says it's for "Web professionals," after all. The important part is to extract the technical information (if any) address that (publically or privately), and ignore the emotional stuff.
"Experience is what allows you to recognize a mistake when you make it again"... :)
It's not so important that they answer, but rather that they look into this Slurp China problem and either fix their code or publish comprehensive and correct robots.txt info for Webmasters wishing to distinguish between the various versions of Slurp. It's always surprising to me to find problems with robots and their documentation; Without a working robot and Webmasters who understand it's needs/requirements with respect to robots.txt files, all the rest of the overlaying search engine effort is potentially crippled.
Hope for the best, plan for the worst, and don't let it get you down... (It's only castles burning (Yeah, showing my 'experience' there, too)).
Jim