Yahoo Robots Going Crazy

Forum Moderators: goodroi

Message Too Old, No Replies

Yahoo Robots Going Crazy

belfastboy

10:31 am on Mar 23, 2006 (gmt 0)

Hi Guys,

For the past few weeks, since I added Adsense actually, the Yahoo robots have being going mad! Yet I'm still not getting placed very well in yahoo search engine!

Sessions with tag Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...] - 4452
Is this normal? How can I monopolise on this?

marketingmagic

4:04 pm on Apr 23, 2006 (gmt 0)

Any one have any clue how to get slurp to actually listen to the ban in the robot.txt?

We have all spiders banned on our .ca version yet Yahoo and MSN still index the site.

They claim adherance to this protocal yet in reality they don't.

Oh - as for slurp going crazy, that's just the way slurp is. It's always the most active spider on our site. Just the way they do it. Waste of everyone's bandwidth, but what can you do?

Pfui

7:39 pm on Apr 23, 2006 (gmt 0)

belfastboy --

Adsense is a G thing, with its own crawlers (Mediapartners-Google*; Mediapartners-Google/2.1), and its presence or absence shouldn't affect Yahoo's crawlers.

belfastboy and marketingmagic --

1.) On my sites, all of Yahoo's crawlers respect robots.txt except one (which I then block via mod_rewrite). Here's the bad one:

Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

2.) The following robots.txt instructions work for the rest of Yahoo's crawlers (of which there are a LOT). There is some question as to the upper- and lower-case names of a number of them so I include multiple versions just in case.

These instructions are excerpted from my robots.txt file, and include three parts: reference notes to myself at the top (the # means they're to be ignored by crawlers), all disallowed Yahoo-related crawlers in the middle, and then "Slurp" -- the only one I allow and only then with very specific instructions.

Again, I find the following effectively shuts out all of Y!'s crawlers (except for "China" mentioned above) and also successfully controls "Slurp":

#
# YAHOO-related
# Slurp: [help.yahoo.com...]
# HOST: .inktomisearch.com; .mail.mud.yahoo.com
# Slurp CHINA: [misc.yahoo.com.cn...]
# Slurp DE: [help.yahoo.com...]
# Blogs: [help.yahoo.com...]
# MM: mms dash mmcrawler dash support at yahoo dash inc dot com
# Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
# Yahoo! Mindset (http://mindset.research.yahoo.com/)
#

User-agent: Yahoo-Blogs
User-agent: Yahoo-Blogs/v3.9
User-agent: Yahoo-MMCrawler
User-agent: Yahoo-MMCrawler/3.x
User-agent: YahooYSMcm
User-agent: YahooYSMcm/2.0.0
User-agent: Yahoo! Mindset
User-agent: Y!J-BSC
User-agent: Y!J-BSC/1.0
User-agent: Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: y!j-bsc
User-agent: y!j-bsc/1.0
User-agent: y!j-bsc/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: Y!J
User-agent: Y!J/1.0
User-agent: Y!J/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: y!j
User-agent: y!j/1.0
User-agent: y!j/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: Mozilla/4.0 (compatible; Y!J; for robot study; keyoshid)
User-agent: Mozilla/4.0 (compatible; y!j; for robot study; keyoshid)
User-agent: Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
User-agent: Mozilla/5.0 (compatible; Yahoo! DE Slurp; [help.yahoo.com...]
Disallow: /

User-agent: Slurp
Crawl-delay: 30
Disallow: /dir
Disallow: /filename.cgi
Disallow: /cgi-bin
Disallow: /dir1/filename1.html
Disallow: /dir1/filename2.html
Disallow: /dir1/filename1.txt
Disallow: /dir1/filename2.txt
Disallow: /dir1/filename3.txt
Disallow: /dir1/filename4.txt
Disallow: /dir1/filename5.txt
Disallow: /dir1/filename6.txt
Disallow: /dir1/sub-dir
Disallow: /dir1/filename3.html
Disallow: /dir1/filename4.html
Disallow: /dir1/filename5.html
Disallow: /dir1/filename6.html
Disallow: /dir2/sub-dir1
Disallow: /dir2/sub-dir2
Disallow: /dir2/sub-dir3
Disallow: /dir3
Disallow: /dir4
Disallow: /dir5
Disallow: /dir6
Disallow: /dir7
Disallow: /dir8
Disallow: /filename.txt
Disallow: /filename.html
Disallow: /dir9/sub-dir
Disallow: /dir10/filename.html
Disallow: /dir11/sub-dir
Disallow: /dir12/filename.html
Disallow: /dir13/sub-dir
Disallow: /dir14/filename.html

P.S.
I've also found that no one obeys crawl delays, even when their info pages say they do. I include them anyway -- hope springs eternal:)