Forum Moderators: goodroi
http://www.example.com/content/index.php?action=page;u=000
even though I have excluded them in my robots.txt file for several years using;
User-agent: *
Disallow: /content/*action=
As msnbot is supposed to handle wildcards, I thought that this would work (this rule works fine with Yahoo and Google) I'm wondering if there is anything else that I could try? I've checked the IP address of the bots and they are coming from MS, so it's not someone spoofing the msnbot.
I thought that maybe I ought to specify msnbot directly and try one (or all) of these alternatives?
User-agent: msnbot
Disallow: /content/index.php?action=*
Disallow: /content/*action=*
Disallow: /content/*action=
Any suggestions?
According to their documentation, they only claim to handle wildcards in the context of disallowing by filetype, giving, for example:
Disallow: /*.jpeg$
to prevent JPEG image files from being crawled. However, I see nothing to indicate that they'd support a "wildcard in the middle and at the end" such as what you want/need. Note that they require the "$ at the end as an "end-anchor", which wouldn't work with your application, since your query isn't fully-specified and end-anchored.
Unfortunately, it looks like it's time for "Plan B" if you have one, such as using search-engine-friendly URLs devoid of query strings. If you do that, then you can redirect any requested query-string URL to the proper static URL, or outright 403 or 410 them.
Jim
Unfortunately this is plan B. The main URLs do use SE friendly URLs. By using that wildcard rule I was hoping to exclude all the 'dynamic' URLs. It works for Google and Yahoo, but...
I wonder. Is it possible to exclude file extensions on a directory basis for msnbot?
The reason I ask is this. The main part of my site uses .php extensions and needs to be indexed. But all the forum URLs that I want to exclude use .php extensions - the SE URLs use .html.
So if the forum sits in a directory called 'forum', could I have a rule something like;
Disallow: /forum/*.php$