homepage Welcome to WebmasterWorld Guest from 54.227.20.250
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Yahoo Robots Going Crazy
belfastboy

5+ Year Member



 
Msg#: 895 posted 10:31 am on Mar 23, 2006 (gmt 0)

Hi Guys,

For the past few weeks, since I added Adsense actually, the Yahoo robots have being going mad! Yet I'm still not getting placed very well in yahoo search engine!

Sessions with tag Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...] - 4452
Is this normal? How can I monopolise on this?

R

 

marketingmagic

5+ Year Member



 
Msg#: 895 posted 4:04 pm on Apr 23, 2006 (gmt 0)

Any one have any clue how to get slurp to actually listen to the ban in the robot.txt?

We have all spiders banned on our .ca version yet Yahoo and MSN still index the site.

They claim adherance to this protocal yet in reality they don't.

Oh - as for slurp going crazy, that's just the way slurp is. It's always the most active spider on our site. Just the way they do it. Waste of everyone's bandwidth, but what can you do?

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 895 posted 7:39 pm on Apr 23, 2006 (gmt 0)

belfastboy --

Adsense is a G thing, with its own crawlers (Mediapartners-Google*; Mediapartners-Google/2.1), and its presence or absence shouldn't affect Yahoo's crawlers.

belfastboy and marketingmagic --

1.) On my sites, all of Yahoo's crawlers respect robots.txt except one (which I then block via mod_rewrite). Here's the bad one:

Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

2.) The following robots.txt instructions work for the rest of Yahoo's crawlers (of which there are a LOT). There is some question as to the upper- and lower-case names of a number of them so I include multiple versions just in case.

These instructions are excerpted from my robots.txt file, and include three parts: reference notes to myself at the top (the # means they're to be ignored by crawlers), all disallowed Yahoo-related crawlers in the middle, and then "Slurp" -- the only one I allow and only then with very specific instructions.

Again, I find the following effectively shuts out all of Y!'s crawlers (except for "China" mentioned above) and also successfully controls "Slurp":

#
# YAHOO-related
# Slurp: [help.yahoo.com...]
# HOST: .inktomisearch.com; .mail.mud.yahoo.com
# Slurp CHINA: [misc.yahoo.com.cn...]
# Slurp DE: [help.yahoo.com...]
# Blogs: [help.yahoo.com...]
# MM: mms dash mmcrawler dash support at yahoo dash inc dot com
# Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
# Yahoo! Mindset (http://mindset.research.yahoo.com/)
#

User-agent: Yahoo-Blogs
User-agent: Yahoo-Blogs/v3.9
User-agent: Yahoo-MMCrawler
User-agent: Yahoo-MMCrawler/3.x
User-agent: YahooYSMcm
User-agent: YahooYSMcm/2.0.0
User-agent: Yahoo! Mindset
User-agent: Y!J-BSC
User-agent: Y!J-BSC/1.0
User-agent: Y!J-BSC/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: y!j-bsc
User-agent: y!j-bsc/1.0
User-agent: y!j-bsc/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: Y!J
User-agent: Y!J/1.0
User-agent: Y!J/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: y!j
User-agent: y!j/1.0
User-agent: y!j/1.0 (http://help.yahoo.co.jp/help/jp/search/indexing/indexing-15.html)
User-agent: Mozilla/4.0 (compatible; Y!J; for robot study; keyoshid)
User-agent: Mozilla/4.0 (compatible; y!j; for robot study; keyoshid)
User-agent: Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
User-agent: Mozilla/5.0 (compatible; Yahoo! DE Slurp; [help.yahoo.com...]
Disallow: /

User-agent: Slurp
Crawl-delay: 30
Disallow: /dir
Disallow: /filename.cgi
Disallow: /cgi-bin
Disallow: /dir1/filename1.html
Disallow: /dir1/filename2.html
Disallow: /dir1/filename1.txt
Disallow: /dir1/filename2.txt
Disallow: /dir1/filename3.txt
Disallow: /dir1/filename4.txt
Disallow: /dir1/filename5.txt
Disallow: /dir1/filename6.txt
Disallow: /dir1/sub-dir
Disallow: /dir1/filename3.html
Disallow: /dir1/filename4.html
Disallow: /dir1/filename5.html
Disallow: /dir1/filename6.html
Disallow: /dir2/sub-dir1
Disallow: /dir2/sub-dir2
Disallow: /dir2/sub-dir3
Disallow: /dir3
Disallow: /dir4
Disallow: /dir5
Disallow: /dir6
Disallow: /dir7
Disallow: /dir8
Disallow: /filename.txt
Disallow: /filename.html
Disallow: /dir9/sub-dir
Disallow: /dir10/filename.html
Disallow: /dir11/sub-dir
Disallow: /dir12/filename.html
Disallow: /dir13/sub-dir
Disallow: /dir14/filename.html

#

P.S.
I've also found that no one obeys crawl delays, even when their info pages say they do. I include them anyway -- hope springs eternal:)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved