Dijkgraaf

msg:1528815 | 8:44 pm on May 18, 2006 (gmt 0) |
No, wildcards aren't part of the standerd and not supported by most bots. However in your example you could disallow disallow: /page.asp?page or even disallow: /page.asp? which would allow /page.asp but not allow /page.asp?page=1 etc. because they disallow rule will disallow any URL begining what you specify.
|
Reid

msg:1528816 | 6:43 am on May 30, 2006 (gmt 0) |
wildcards are valid in User-agent and for googlebot they are valid in the Disallow User-agent:googlebot Disallow:/page.asp?* User-agent:* Disallow: This will allow all for all user agents except googlebot which will disallow all page.asp with a? Make sure you use the googlebot directive first else googlebot will follow the User-agent:* directive.
|
Pfui

msg:1528817 | 8:58 pm on May 30, 2006 (gmt 0) |
1.) For any newcomers following along, be sure you include a space after any colon: User-agent: Badbot Disallow: /file.html Disallow: /directory 2.) A blank Disallow is the same as saying Allow. Adding a forward slash will turn away all crawlers inclined to heed the most basic robots.txt: User-agent: * Disallow: / 3.) Nowadays bots' preferences can be (head-bangingly) unique and specific, so it's always a good idea to go to the source: [search.msn.com...] See also: EXAMPLES [search.msn.com] (including wildcards) [robotstxt.org...] [google.com...] See also: EXAMPLES [google.com] (including wildcards)
|
Pfui

msg:1528818 | 9:03 pm on May 30, 2006 (gmt 0) |
4.) Googlebot's info doesn't have to come first. From the EXAMPLES page, above: When creating your robots.txt file, please keep the following in mind: When deciding which pages to crawl on a particular host, Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*". (Odd. I coudn't edit my just-made post to add that bit.)
|
Reid

msg:1528819 | 11:40 pm on May 30, 2006 (gmt 0) |
I've seen other posts on WW of people having problems with googlebot using the wrong directive, turns out they had Uaser-agent: * before User-agent: googlebot it's always safe to assume that any bot will follow the first valid directive it finds so User-Agent: * should always be at the end of the robots.txt file (meaning all bots that have not been named directly) Just good code practice anyway.
|
Reid

msg:1528820 | 12:05 am on May 31, 2006 (gmt 0) |
I just took a closer look at the examples here from MSN and Google. You have to be careful | Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*". |
| This statement , if read carefully does not say what will happen when BOTH directives are used. Bots do not double back and check the robots.txt file twice, it will likely follow the first valid directive, either User-Agent: * or User-agent: googlebot (whichever comes first) Also the MSN example does NOT say that psbot will recognize any * and for MSNbot it ONLY shows the file extension wildcard (note that $ must be present) it does not say that MSNbot allows the use of wildcards in any other way than with file extensions. Example Disallow: *.gif$ ($ must be used)
|
Pfui

msg:1528821 | 12:53 am on May 31, 2006 (gmt 0) |
FWIW, my robots.txt files (multiple sites) begin with the generic block -- User-agent: * Disallow: / -- followed by all other entries in alphabetical order, some very detailed (msnbot, Googlebot, Slurp), some simple. All specifically identified major SE bots find 'their' instructions and follow them 99% of the time. Now if only they'd all follow the same conventions! (See: AdsBot-Google's robots.txt specs [webmasterworld.com].)
|
|