Why is msnbot ignoring this robot's.txt rule?

Forum Moderators: goodroi

Message Too Old, No Replies

Why is msnbot ignoring this robot's.txt rule?

Wildcard disallow ignored

bouncybunny

1:45 am on Jun 6, 2007 (gmt 0)

MSNBOT is crawling URLs like this;

http://www.example.com/content/index.php?action=page;u=000

even though I have excluded them in my robots.txt file for several years using;

User-agent: *
Disallow: /content/*action=

As msnbot is supposed to handle wildcards, I thought that this would work (this rule works fine with Yahoo and Google) I'm wondering if there is anything else that I could try? I've checked the IP address of the bots and they are coming from MS, so it's not someone spoofing the msnbot.

I thought that maybe I ought to specify msnbot directly and try one (or all) of these alternatives?

User-agent: msnbot
Disallow: /content/index.php?action=*
Disallow: /content/*action=*
Disallow: /content/*action=

Any suggestions?

jdMorgan

2:23 am on Jun 6, 2007 (gmt 0)

> As msnbot is supposed to handle wildcards

According to their documentation, they only claim to handle wildcards in the context of disallowing by filetype, giving, for example:

Disallow: /*.jpeg$

to prevent JPEG image files from being crawled. However, I see nothing to indicate that they'd support a "wildcard in the middle and at the end" such as what you want/need. Note that they require the "$ at the end as an "end-anchor", which wouldn't work with your application, since your query isn't fully-specified and end-anchored.

Unfortunately, it looks like it's time for "Plan B" if you have one, such as using search-engine-friendly URLs devoid of query strings. If you do that, then you can redirect any requested query-string URL to the proper static URL, or outright 403 or 410 them.

Jim

bouncybunny

12:17 pm on Jun 6, 2007 (gmt 0)

Thanks Jim.

Unfortunately this is plan B. The main URLs do use SE friendly URLs. By using that wildcard rule I was hoping to exclude all the 'dynamic' URLs. It works for Google and Yahoo, but...

I wonder. Is it possible to exclude file extensions on a directory basis for msnbot?

The reason I ask is this. The main part of my site uses .php extensions and needs to be indexed. But all the forum URLs that I want to exclude use .php extensions - the SE URLs use .html.

So if the forum sits in a directory called 'forum', could I have a rule something like;

Disallow: /forum/*.php$