Forum Moderators: goodroi

Message Too Old, No Replies

Google Robots.txt Wildcard Feature

Please take a look at my robot.txt

         

seunosewa

7:09 pm on May 30, 2005 (gmt 0)

10+ Year Member



My robots.txt reads:
--------------
User-agent: *
Disallow: /forum
Disallow: /nigeria?
Disallow: /?
Disallow: /index.php

User-agent: Googlebot
Disallow: /forum
Disallow: /index.php
Disallow: /nigeria?
Disallow: /?
Disallow: /*msg
---------------
I expected the last directive to prevent Googlebot from downloading any url with the string 'msg' in it, but googlebot seems to be ignoring that directive and downloading files like /forum/topic-12.msg45454.html. What could be wrong?

Dijkgraaf

10:12 pm on Jun 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Read
[robotstxt.org...]

There are various mistakes you have made.
1) It should be Disallow: /forum/ (Note the trailing slash).
2) Wild cards such as /? and /*msg, /nigeria? are not part of the standard (although some bots do allow it, but Google isn't one of them). See relevent section from the above mentioned standard
"Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif"."
3) It probably pays to put the User-agent: * section last as some bots at either the first match for the User-agent or the *

Reid

6:58 am on Jun 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



googlebot does allow wildcards but it may be following the user-agent * directive instead of the googlebot one.
user-agent: * should always b the last directive since most bots will follow their own directive or * whichever comes first.

disallow: /forums
will disallow a directory called forums and a file called forums in the root directory

disallow: /forums/
will disallow a directory called forums but not a file called forums.

Reid

7:02 am on Jun 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I expected the last directive to prevent Googlebot from downloading any url with the string 'msg' in it, but googlebot seems to be ignoring that directive and downloading files like /forum/topic-12.msg45454.html. What could be wrong?

This should work for googlebot it must be using the user-agent: * because it comes first.

never use wildcards in the directives under user-agent: * because most bots can't handle wildcards in the disallow line but googlebot definitely can.