DaveAtIFG

msg:1528259 | 2:31 pm on Oct 21, 2003 (gmt 0) |
Brett made a copy of the Robots.txt File Exclusion Standard and Format [searchengineworld.com] at SEW and it's the only copy I can find anymore! First, note these statements: It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots. |
| Next, note the Format. Wild cards are only acceptable in the User-Agent field. You may block entire subdirectories or individual files. ONLY! Since this is a convention, some SEs MAY have extended the capabilities for their own spiders. If so, this information will be available on their webmaster pages. I've never seen any indication that any SE has done this.
|
Nick_W

msg:1528260 | 3:00 pm on Oct 21, 2003 (gmt 0) |
So the general syntax bar the regex is okay? Nick
|
DrDoc

msg:1528261 | 3:20 pm on Oct 21, 2003 (gmt 0) |
I would suggest something like this: Disallow: /viewtopic.php? No asterisk, and no dollar sign. I believe that would work...
|
DaveAtIFG

msg:1528262 | 3:21 pm on Oct 21, 2003 (gmt 0) |
Nope. There is no provision for regex anywhere. There is a provision for a wildcard in the user_agent. You can specify a specific user-agent or use "*" to block them all. As to disallow: Disallow The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. |
| Dr Doc's suggestion will block every file named "/viewtopic.php?" but will not block any file named "/viewtopic.php?something-after-the-question-mark" according to the protocol doc. [edited by: DaveAtIFG at 3:28 pm (utc) on Oct. 21, 2003]
|
Nick_W

msg:1528263 | 3:25 pm on Oct 21, 2003 (gmt 0) |
Actually, as we're talking about Google in this particular instance. Take a look at this.... [google.com] I have tried that, but to no avail so I wondered if there was somthing else wrong with my syntax. I do have a whole bunch of stuff shamelessly copied from BT's robots file above those statements I've posted here? Nick
|
DrDoc

msg:1528264 | 3:28 pm on Oct 21, 2003 (gmt 0) |
Well, looking at Google's own robots.txt, they are using the question mark, but without a regexp style look.
|
jdMorgan

msg:1528265 | 3:32 pm on Oct 21, 2003 (gmt 0) |
Nick_W, Also watch out for the order that you put your two User-agent records in. A robot will accept the first record which matches its User-agent name or "*" -- whichever comes first. So, your "Googlebot" record must be first, followed by the "*" record. Googlebot will find its record, read it, and leave. Others will find the Googlebot record, ignore it because it does not match their User-agent name, and then accept the "*" record as a match. AFAIK, the only "big" search engine that supports extensions to the Standard for Robots Exclusion is Google, as documented in their Webmaster Help section. Jim
|
Nick_W

msg:1528266 | 3:32 pm on Oct 21, 2003 (gmt 0) |
Yes, and it's on directories not files so that's hard to apply to my situation... Nick
|
DrDoc

msg:1528267 | 3:35 pm on Oct 21, 2003 (gmt 0) |
It doesn't matter if it's on directories or files... /foobar? could be either the directory 'foobar', or even a file named 'foobar'. The important thing to remember is what Dave said: | any URL that starts with this value will not be retrieved |
|
|
Nick_W

msg:1528268 | 3:38 pm on Oct 21, 2003 (gmt 0) |
Yeah. I got that, but it was not my practicle experience of it ;) Guess I'll just strip down that file and run some more tests... Thanks guys Nick
|
DrDoc

msg:1528269 | 3:38 pm on Oct 21, 2003 (gmt 0) |
Any time ;) I also found this: [google.com...] | 12. How do I tell Googlebot not to crawl dynamically generated pages on my site? The following robots.txt file will achieve this. User-agent: Googlebot Disallow: /*? |
|
|
Nick_W

msg:1528270 | 3:40 pm on Oct 21, 2003 (gmt 0) |
Ooooooooh! It's like Christmas morning! nick makes a dive for a shell and Vim.... ;-) Nick
|
|