Robots.txt - Am I Missing Somthing? - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt - Am I Missing Somthing?

Query strings and general syntax is suspected...

Nick_W

9:30 am on Oct 21, 2003 (gmt 0)

Hi everyone,

I'm running a small forum and using mod rewrite to alter the urls a little.

I can't seem to stop spiders from grabbing redundant files though. The relevant bit of my robots.txt will probably explain better:

User-agent: * Disallow: /posting.php Disallow: /viewtopic.php Disallow: /viewforum.php Disallow: /privmsg.php Disallow: /profile.php Disallow: /search.php ## This bit just for Google as I 'thought' ## it would help

User-agent: Googlebot Disallow: /index.php?*$ Disallow: /posting.php?*$ Disallow: /viewtopic.php?*$ Disallow: /viewforum.php?*$ Disallow: /privmsg.php?*$ Disallow: /profile.php?*$ Disallow: /search.php?*$

Is there somthing wrong with my general syntax? - G is still picking up viewtopic?t=23&etc and similar urls.

Many thanks for any insight, I'm truly at the hair pulling stage ;)

Nick

DaveAtIFG

2:31 pm on Oct 21, 2003 (gmt 0)

Brett made a copy of the Robots.txt File Exclusion Standard and Format [searchengineworld.com] at SEW and it's the only copy I can find anymore!

First, note these statements:

It is not an official standard backed by a standards body, or owned by any commercial organisation.
It is not enforced by anybody, and there no guarantee that all current and future robots will use it.
Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

Next, note the Format. Wild cards are only acceptable in the User-Agent field.

You may block entire subdirectories or individual files. ONLY!

Since this is a convention, some SEs MAY have extended the capabilities for their own spiders. If so, this information will be available on their webmaster pages. I've never seen any indication that any SE has done this.

Nick_W

3:00 pm on Oct 21, 2003 (gmt 0)

So the general syntax bar the regex is okay?

Nick

DrDoc

3:20 pm on Oct 21, 2003 (gmt 0)

I would suggest something like this:

Disallow: /viewtopic.php?

No asterisk, and no dollar sign. I believe that would work...

DaveAtIFG

3:21 pm on Oct 21, 2003 (gmt 0)

Nope. There is no provision for regex anywhere. There is a provision for a wildcard in the user_agent. You can specify a specific user-agent or use "*" to block them all.

As to disallow:

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.

Dr Doc's suggestion will block every file named "/viewtopic.php?" but will not block any file named "/viewtopic.php?something-after-the-question-mark" according to the protocol doc.

[edited by: DaveAtIFG at 3:28 pm (utc) on Oct. 21, 2003]

Nick_W

3:25 pm on Oct 21, 2003 (gmt 0)

Actually, as we're talking about Google in this particular instance. Take a look at this.... [google.com]

I have tried that, but to no avail so I wondered if there was somthing else wrong with my syntax.

I do have a whole bunch of stuff shamelessly copied from BT's robots file above those statements I've posted here?

Nick

DrDoc

3:28 pm on Oct 21, 2003 (gmt 0)

Well, looking at Google's own robots.txt, they are using the question mark, but without a regexp style look.

jdMorgan

3:32 pm on Oct 21, 2003 (gmt 0)

Nick_W,

Also watch out for the order that you put your two User-agent records in. A robot will accept the first record which matches its User-agent name or "*" -- whichever comes first. So, your "Googlebot" record must be first, followed by the "*" record. Googlebot will find its record, read it, and leave. Others will find the Googlebot record, ignore it because it does not match their User-agent name, and then accept the "*" record as a match.

AFAIK, the only "big" search engine that supports extensions to the Standard for Robots Exclusion is Google, as documented in their Webmaster Help section.

Jim

Nick_W

3:32 pm on Oct 21, 2003 (gmt 0)

Yes, and it's on directories not files so that's hard to apply to my situation...

Nick

DrDoc

3:35 pm on Oct 21, 2003 (gmt 0)

It doesn't matter if it's on directories or files...

/foobar? could be either the directory 'foobar', or even a file named 'foobar'. The important thing to remember is what Dave said:

any URL that starts with this value will not be retrieved

Nick_W

3:38 pm on Oct 21, 2003 (gmt 0)

Yeah. I got that, but it was not my practicle experience of it ;)

Guess I'll just strip down that file and run some more tests...

Thanks guys

Nick

DrDoc

3:38 pm on Oct 21, 2003 (gmt 0)

Any time ;)

I also found this:

[google.com...] How do I tell Googlebot not to crawl dynamically generated pages on my site?

The following robots.txt file will achieve this.

User-agent: Googlebot
Disallow: /*?

Nick_W

3:40 pm on Oct 21, 2003 (gmt 0)

Ooooooooh!

It's like Christmas morning!

nick makes a dive for a shell and Vim....

;-)

Nick