Yahoo Slurp misbehaving?

Forum Moderators: open

Message Too Old, No Replies

Yahoo Slurp misbehaving?

How Your Robots.txt May Play a Role

tigertom

9:28 pm on Jan 5, 2007 (gmt 0)

User Agent : Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Server time : 20:43:23 1/05/2007
Apparent IP : 74.6.86.48
Remote Host : ct501212.inktomisearch.com

I have modified my robots.txt file recently, but not in a way that should cause this, I think.

BUT I did put in a rule that mentioned Slurp specfically; would that mean it won't read the
User-agent: * directives that come later in the file?

User-agent: Googlebot-Image
Disallow: /
#
# User-agent: Mediapartners-Google
# Disallow: /
#
User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
#
User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
#
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
#
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/

Staffa

9:43 pm on Jan 6, 2007 (gmt 0)

I noticed the same today :

74.6.86.210 went for a page in a disallowed directory but which is linked to from outside that directory

74.6.86.70 went for a page in the same disallowed directory which is not linked to from anywhere outside that directory

74.6.87.71 went for the default of the same disallowed directory - which does not exist - and got a 403.

The site has one User-agent: * in the robots.txt disallowing that directory to all.

jdMorgan

10:54 pm on Jan 6, 2007 (gmt 0)

You don't really have those "#" characters between records do you?

A blank line is required after each record, whether or not you have comment lines starting with "#". There's even one 'bot that used to consider a robots.txt file to be invalid without a blank line at the very end.

Just a thought.

Jim

tigertom

12:24 pm on Jan 7, 2007 (gmt 0)

Ha, that's it!

I thought I was being clever.

Thank you Jim.

tigertom

8:53 pm on Jan 7, 2007 (gmt 0)

Um, no it isn't. It did it again, after I'd changed my robots.txt file.

Note that other SE bots, including other Slurp bots, are not tripping the filter.

Just this one:

Apparent IP : 74.6.86.48
Remote Host : ct501212.inktomisearch.com

Perhaps it takes time to digest the revised file?

tigertom

11:12 pm on Jan 10, 2007 (gmt 0)

Nope, it's still doing it:

User Agent : Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Server time : 21:25:30 1/09/2007
Apparent IP : 74.6.86.220
Remote Host : ct501087.inktomisearch.com

Marcia

9:31 am on Jan 17, 2007 (gmt 0)

I've been having the same exact thing happening but it's hammering the directory that's only used if a site is suspended, which is ridiculous.

Are you on virtual hosting?

tigertom

11:01 am on Jan 17, 2007 (gmt 0)

Yes, 'though I have to say I haven't got a notification in the last 48 hours.
It may have got the message.

Tim

6:54 pm on Jan 17, 2007 (gmt 0)

Tigertom stated "BUT I did put in a rule that mentioned Slurp specifically; would that mean it won't read the
User-agent: * directives that come later in the file?" \

That is correct. When there is an agent-specific rule, the crawler applies that rule, not the generic rule. The "User-Agent: *" rule is used only if no other rule matches, not applied in addition to other matching rules.

His listed /robots.txt file says:
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
#
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/

So Slurp is disallowed from /bloop/ and /blop/, but not disallowed form
/shop/, /forum/ and /cgi-bin/.

tigertom

7:07 pm on Jan 17, 2007 (gmt 0)

I had changed it to the directive order below, and the Bad Slurp Bot still accessed the disallowed sub-directory. As I said, it seems to have stopped now, but is the following correct?:

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/

i.e. will all bots avoid the bottrap directory, and Slurp also avoid the specified ones, if they're obedient little bots?

jimbeetle

7:20 pm on Jan 17, 2007 (gmt 0)

If I read Tim correctly (and that's Yahoo's Tim, by the way), then no, Slurp will only obey the Slurp-specified directive.

Now, big question and possibly big can of worms: Do other 'bots process specific and generic directives the same as Slurp? Or, do some obey both specific and generic? I've never happened to run across anything on this.

tigertom

10:25 pm on Jan 17, 2007 (gmt 0)

Ok, a simple solution.

To make sure, just add the extra Disallow directives to the Slurp block (and the other Bots'). Problem solved.

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

I think: Put the wildcard ones first, for all the other, less important bots. Then you put the bot-specific ones.

If a bot comes across its name, it'll stop there.

I _think_ that's right :)

My objective here: To stop the most important SE bots indexing the insubstantial CMS pages in /bloop/ and /blop/, except the Adsense bot :).

Thanks for the help.

Marcia

6:16 am on Jan 18, 2007 (gmt 0)

The reason I asked if it's virtual hosting is because mine had nothing to do with my site, there was ample evidence of it being an issue with the server configuration.

Tech support found the problem and I was just notified that there was, indeed, a problem with the A-name setup and the error has been corrected.

It looked like an infinite loop was happening, the way Slurp was hammering away, and I believe that there's still something Yahoo needs to check into, since theirs is the only crawler that ran into this mess.

I filled out the support form for Search yesterday with as much detail as I could at the time, and wish there had at least been an auto-response so that I could get some more details from the host, in addition to what I've already found out, and give them a follow-up because if it's happened to some now it could well happen to others in the future.

tigertom

11:47 am on Jan 18, 2007 (gmt 0)

I think I was wrong again. The GoogleBot tripped my latest robots.txt file. I'm now trying this configuration:

User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

The idea is the main bots get their to their directives, and stop. The rest carry on to the wildcard.

jimbeetle

3:47 pm on Jan 18, 2007 (gmt 0)

You can shorten this up a bit:

User-agent: Slurp
User-agent: msnbot
User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/

tigertom

5:27 pm on Jan 18, 2007 (gmt 0)

Ah, interesting. Thank you, Jim.

Marcia

9:07 pm on Jan 18, 2007 (gmt 0)

Let me again reiterate that Slurp can be (and was) tripped up by a server misconfiguration in virtual hosting. That being the case, nothing can be done on the site itself, it's a server issue.

That is EXACTLY what happened in my case, and once my host identified and corrected the issue, it stopped completely and has been 100% back to normal.

I've received an exceptionally nice response from the support team, including a reference to published information on Yahoo's site about what to do about crawl issues

How can I reduce the number of requests you make on my web site? [help.yahoo.com]

That would apply under normal circumstances, but in my case it was a situation of being caught in an endless loop - now fixed, but it took some digging to find the cause. I also got a ton of referrers from other sites on the same server, which showed up in Webalizer, of all places.

More information on the YSearch Blog

[ysearchblog.com...]

[edited by: Marcia at 9:09 pm (utc) on Jan. 18, 2007]