homepage Welcome to WebmasterWorld Guest from 54.235.16.159
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
Forum Library, Charter, Moderators: martinibuster

Yahoo Search Engine and Directory Forum

    
Yahoo Slurp misbehaving?
How Your Robots.txt May Play a Role
tigertom




msg:3209802
 9:28 pm on Jan 5, 2007 (gmt 0)

User Agent : Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Server time : 20:43:23 1/05/2007
Apparent IP : 74.6.86.48
Remote Host : ct501212.inktomisearch.com

I have modified my robots.txt file recently, but not in a way that should cause this, I think.

BUT I did put in a rule that mentioned Slurp specfically; would that mean it won't read the
User-agent: * directives that come later in the file?

User-agent: Googlebot-Image
Disallow: /
#
# User-agent: Mediapartners-Google
# Disallow: /
#
User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
#
User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
#
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
#
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/

 

Staffa




msg:3210714
 9:43 pm on Jan 6, 2007 (gmt 0)

I noticed the same today :

74.6.86.210 went for a page in a disallowed directory but which is linked to from outside that directory

74.6.86.70 went for a page in the same disallowed directory which is not linked to from anywhere outside that directory

74.6.87.71 went for the default of the same disallowed directory - which does not exist - and got a 403.

The site has one User-agent: * in the robots.txt disallowing that directory to all.

jdMorgan




msg:3210757
 10:54 pm on Jan 6, 2007 (gmt 0)

You don't really have those "#" characters between records do you?

A blank line is required after each record, whether or not you have comment lines starting with "#". There's even one 'bot that used to consider a robots.txt file to be invalid without a blank line at the very end.

Just a thought.

Jim

tigertom




msg:3211167
 12:24 pm on Jan 7, 2007 (gmt 0)

Ha, that's it!

I thought I was being clever.

Thank you Jim.

tigertom




msg:3211494
 8:53 pm on Jan 7, 2007 (gmt 0)

Um, no it isn't. It did it again, after I'd changed my robots.txt file.

Note that other SE bots, including other Slurp bots, are not tripping the filter.

Just this one:

Apparent IP : 74.6.86.48
Remote Host : ct501212.inktomisearch.com

Perhaps it takes time to digest the revised file?

tigertom




msg:3215387
 11:12 pm on Jan 10, 2007 (gmt 0)

Nope, it's still doing it:

User Agent : Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Server time : 21:25:30 1/09/2007
Apparent IP : 74.6.86.220
Remote Host : ct501087.inktomisearch.com

Marcia




msg:3221974
 9:31 am on Jan 17, 2007 (gmt 0)

I've been having the same exact thing happening but it's hammering the directory that's only used if a site is suspended, which is ridiculous.

Are you on virtual hosting?

tigertom




msg:3222050
 11:01 am on Jan 17, 2007 (gmt 0)

Yes, 'though I have to say I haven't got a notification in the last 48 hours.
It may have got the message.

Tim




msg:3222606
 6:54 pm on Jan 17, 2007 (gmt 0)

Tigertom stated "BUT I did put in a rule that mentioned Slurp specifically; would that mean it won't read the
User-agent: * directives that come later in the file?" \

That is correct. When there is an agent-specific rule, the crawler applies that rule, not the generic rule. The "User-Agent: *" rule is used only if no other rule matches, not applied in addition to other matching rules.

His listed /robots.txt file says:
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
#
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/

So Slurp is disallowed from /bloop/ and /blop/, but not disallowed form
/shop/, /forum/ and /cgi-bin/.

tigertom




msg:3222624
 7:07 pm on Jan 17, 2007 (gmt 0)

I had changed it to the directive order below, and the Bad Slurp Bot still accessed the disallowed sub-directory. As I said, it seems to have stopped now, but is the following correct?:

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/

i.e. will all bots avoid the bottrap directory, and Slurp also avoid the specified ones, if they're obedient little bots?

jimbeetle




msg:3222637
 7:20 pm on Jan 17, 2007 (gmt 0)

If I read Tim correctly (and that's Yahoo's Tim, by the way), then no, Slurp will only obey the Slurp-specified directive.

Now, big question and possibly big can of worms: Do other 'bots process specific and generic directives the same as Slurp? Or, do some obey both specific and generic? I've never happened to run across anything on this.

tigertom




msg:3222857
 10:25 pm on Jan 17, 2007 (gmt 0)

Ok, a simple solution.

To make sure, just add the extra Disallow directives to the Slurp block (and the other Bots'). Problem solved.

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

I think: Put the wildcard ones first, for all the other, less important bots. Then you put the bot-specific ones.

If a bot comes across its name, it'll stop there.

I _think_ that's right :)

My objective here: To stop the most important SE bots indexing the insubstantial CMS pages in /bloop/ and /blop/, except the Adsense bot :).

Thanks for the help.

Marcia




msg:3223214
 6:16 am on Jan 18, 2007 (gmt 0)

The reason I asked if it's virtual hosting is because mine had nothing to do with my site, there was ample evidence of it being an issue with the server configuration.

Tech support found the problem and I was just notified that there was, indeed, a problem with the A-name setup and the error has been corrected.

It looked like an infinite loop was happening, the way Slurp was hammering away, and I believe that there's still something Yahoo needs to check into, since theirs is the only crawler that ran into this mess.

I filled out the support form for Search yesterday with as much detail as I could at the time, and wish there had at least been an auto-response so that I could get some more details from the host, in addition to what I've already found out, and give them a follow-up because if it's happened to some now it could well happen to others in the future.

tigertom




msg:3223430
 11:47 am on Jan 18, 2007 (gmt 0)

I think I was wrong again. The GoogleBot tripped my latest robots.txt file. I'm now trying this configuration:

User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

The idea is the main bots get their to their directives, and stop. The rest carry on to the wildcard.

jimbeetle




msg:3223672
 3:47 pm on Jan 18, 2007 (gmt 0)

You can shorten this up a bit:

User-agent: Slurp
User-agent: msnbot
User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

tigertom




msg:3223860
 5:27 pm on Jan 18, 2007 (gmt 0)

Ah, interesting. Thank you, Jim.

Marcia




msg:3224184
 9:07 pm on Jan 18, 2007 (gmt 0)

Let me again reiterate that Slurp can be (and was) tripped up by a server misconfiguration in virtual hosting. That being the case, nothing can be done on the site itself, it's a server issue.

That is EXACTLY what happened in my case, and once my host identified and corrected the issue, it stopped completely and has been 100% back to normal.

I've received an exceptionally nice response from the support team, including a reference to published information on Yahoo's site about what to do about crawl issues

How can I reduce the number of requests you make on my web site? [help.yahoo.com]

That would apply under normal circumstances, but in my case it was a situation of being caught in an endless loop - now fixed, but it took some digging to find the cause. I also got a ton of referrers from other sites on the same server, which showed up in Webalizer, of all places.

More information on the YSearch Blog

[ysearchblog.com...]

[edited by: Marcia at 9:09 pm (utc) on Jan. 18, 2007]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved