Forum Moderators: goodroi

Message Too Old, No Replies

Wildcards in robots.txt for forum software

         

Asia_Expat

8:05 am on Oct 31, 2006 (gmt 0)

10+ Year Member



I've been trying very hard to manage my IPB installation carefully with consideration for SEO. I have put together the following robots file...

User-agent: *
Disallow: /advertise/
Disallow: /forum/index.php?act=idx
Disallow: /forum/index.php?act=Login
Disallow: /forum/index.php?act=Search
Disallow: /forum/index.php?act=Shoutbox
Disallow: /forum/index.php?act=Reg
Disallow: /forum/index.php?act=Msg
Disallow: /forum/index.php?act=Mail
Disallow: /forum/index.php?act=Forward
Disallow: /forum/index.php?act=Track
Disallow: /forum/index.php?act=Post
Disallow: /forum/index.php?act=Print
Disallow: /forum/index.php?act=ST
Disallow: /forum/index.php?act=boardrules
Disallow: /forum/index.php?act=Help
Disallow: /forum/index.php?act=Stats
Disallow: /forum/index.php?act=Members
Disallow: /forum/index.php?act=Online
Disallow: /forum/index.php?act=calendar
Disallow: /forum/index.php?act=SR
Disallow: /forum/index.php?act=ICQ
Disallow: /forum/index.php?act=MSN
Disallow: /forum/index.php?act=AOL
Disallow: /forum/index.php?act=AIM
Disallow: /forum/index.php?act=SC
Disallow: /forum/index.php?act=task
Disallow: /forum/index.php?act=findpost
Disallow: /forum/index.php?act=UserCP
Disallow: /forum/index.php?&act=
Disallow: /forum/index.php?act=report
Disallow: /forum/index.php?act=buddy
Disallow: /forum/index.php?act=legends
Disallow: /forum/index.php?CODE=
Disallow: /forum/index.php?automodule
Disallow: /forum/index.php?act=attach
Disallow: /forum/index.php?&&CODE=
Disallow: /forum/index.php?&debug=1
Disallow: /forum/index.php?act=Profile
Disallow: /forum/index.php?showuser

... but I'm confused about wildcards and no matter how much I read, I don't understand what can and can't be done. Basically, I also need to exclude the '&mode=threaded' of the following URL (as well as many other parameters that go on the end of the URL)...

http://www.#######.com/forum/index.php?showtopic=695&mode=threaded

Please help me understand what can be done with robots.txt, if anything, to stop indexing all these extensions.

Thanks.

goodroi

4:05 pm on Nov 1, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Most search engine bots do not support the wildcard. Google does support it. Using the wildcard feature in your robots.txt can allow you to block all files ending in in .gif, thus preventing Google from accessing images.

If you are trying to do the best for all search engines you should not use wildcards since not all engines support it.

Asia_Expat

8:38 pm on Nov 1, 2006 (gmt 0)

10+ Year Member



But is there any harm in me using the wildcard for the bots that DO support it?
How do wildacrds work? What's the protocol I should use?

goodroi

1:14 pm on Nov 2, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There is no harm in using it for bots that do support it. Here is the information on how to do it for Google. [google.com...]

Just remember to also address your site issues for search engines that do not support wildcards aka pattern matching.

sssweb

7:57 pm on Nov 13, 2006 (gmt 0)

10+ Year Member



Hi Asia_Expat,

I'm attempting a similar task for a phpbb forum. If you want to disallow bots on ALL index.php pages that end in variable strings, you can simply use the following:

Disallow: /forum/index.php?

As I understand it (someone please correct me if I'm wrong), that disallows any page that STARTS with /forum/index.php? -- which would be all your variable pages.

At the same time, it allows /forum/index.php (without a variable string), because it doesn't exactly match the disallow code.

Asia_Expat

1:40 am on Nov 14, 2006 (gmt 0)

10+ Year Member



Hi there, Thanks... but doing that would disallow every topic on IPB.