Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: goodroi
I'm running a small forum and using mod rewrite to alter the urls a little.
I can't seem to stop spiders from grabbing redundant files though. The relevant bit of my robots.txt will probably explain better:
## This bit just for Google as I 'thought'
## it would help
Is there somthing wrong with my general syntax? - G is still picking up viewtopic?t=23&etc and similar urls.
Many thanks for any insight, I'm truly at the hair pulling stage ;)
First, note these statements:
It is not an official standard backed by a standards body, or owned by any commercial organisation.
It is not enforced by anybody, and there no guarantee that all current and future robots will use it.
Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.
Next, note the Format. Wild cards are only acceptable in the User-Agent field.
You may block entire subdirectories or individual files. ONLY!
Since this is a convention, some SEs MAY have extended the capabilities for their own spiders. If so, this information will be available on their webmaster pages. I've never seen any indication that any SE has done this.
As to disallow:
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.
Dr Doc's suggestion will block every file named "/viewtopic.php?" but will not block any file named "/viewtopic.php?something-after-the-question-mark" according to the protocol doc.
[edited by: DaveAtIFG at 3:28 pm (utc) on Oct. 21, 2003]
I have tried that, but to no avail so I wondered if there was somthing else wrong with my syntax.
I do have a whole bunch of stuff shamelessly copied from BT's robots file above those statements I've posted here?
Also watch out for the order that you put your two User-agent records in. A robot will accept the first record which matches its User-agent name or "*" -- whichever comes first. So, your "Googlebot" record must be first, followed by the "*" record. Googlebot will find its record, read it, and leave. Others will find the Googlebot record, ignore it because it does not match their User-agent name, and then accept the "*" record as a match.
AFAIK, the only "big" search engine that supports extensions to the Standard for Robots Exclusion is Google, as documented in their Webmaster Help section.
I also found this:
[google.com...] How do I tell Googlebot not to crawl dynamically generated pages on my site?
The following robots.txt file will achieve this.