Welcome to WebmasterWorld Guest from 54.211.17.91

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt - Am I Missing Somthing?

Query strings and general syntax is suspected...

   
9:30 am on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Hi everyone,

I'm running a small forum and using mod rewrite to alter the urls a little.

I can't seem to stop spiders from grabbing redundant files though. The relevant bit of my robots.txt will probably explain better:

User-agent: *
Disallow: /posting.php
Disallow: /viewtopic.php
Disallow: /viewforum.php
Disallow: /privmsg.php
Disallow: /profile.php
Disallow: /search.php

## This bit just for Google as I 'thought'
## it would help

User-agent: Googlebot
Disallow: /index.php?*$
Disallow: /posting.php?*$
Disallow: /viewtopic.php?*$
Disallow: /viewforum.php?*$
Disallow: /privmsg.php?*$
Disallow: /profile.php?*$
Disallow: /search.php?*$

Is there somthing wrong with my general syntax? - G is still picking up viewtopic?t=23&etc and similar urls.

Many thanks for any insight, I'm truly at the hair pulling stage ;)

Nick

2:31 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Brett made a copy of the Robots.txt File Exclusion Standard and Format [searchengineworld.com] at SEW and it's the only copy I can find anymore!

First, note these statements:

It is not an official standard backed by a standards body, or owned by any commercial organisation.
It is not enforced by anybody, and there no guarantee that all current and future robots will use it.
Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

Next, note the Format. Wild cards are only acceptable in the User-Agent field.

You may block entire subdirectories or individual files. ONLY!

Since this is a convention, some SEs MAY have extended the capabilities for their own spiders. If so, this information will be available on their webmaster pages. I've never seen any indication that any SE has done this.

3:00 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



So the general syntax bar the regex is okay?

Nick

3:20 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member drdoc is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I would suggest something like this:

Disallow: /viewtopic.php?

No asterisk, and no dollar sign. I believe that would work...

3:21 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nope. There is no provision for regex anywhere. There is a provision for a wildcard in the user_agent. You can specify a specific user-agent or use "*" to block them all.

As to disallow:

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.

Dr Doc's suggestion will block every file named "/viewtopic.php?" but will not block any file named "/viewtopic.php?something-after-the-question-mark" according to the protocol doc.

[edited by: DaveAtIFG at 3:28 pm (utc) on Oct. 21, 2003]

3:25 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Actually, as we're talking about Google in this particular instance. Take a look at this.... [google.com]

I have tried that, but to no avail so I wondered if there was somthing else wrong with my syntax.

I do have a whole bunch of stuff shamelessly copied from BT's robots file above those statements I've posted here?

Nick

3:28 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member drdoc is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Well, looking at Google's own robots.txt, they are using the question mark, but without a regexp style look.
3:32 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Nick_W,

Also watch out for the order that you put your two User-agent records in. A robot will accept the first record which matches its User-agent name or "*" -- whichever comes first. So, your "Googlebot" record must be first, followed by the "*" record. Googlebot will find its record, read it, and leave. Others will find the Googlebot record, ignore it because it does not match their User-agent name, and then accept the "*" record as a match.

AFAIK, the only "big" search engine that supports extensions to the Standard for Robots Exclusion is Google, as documented in their Webmaster Help section.

Jim

3:32 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Yes, and it's on directories not files so that's hard to apply to my situation...

Nick

3:35 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member drdoc is a WebmasterWorld Top Contributor of All Time 10+ Year Member



It doesn't matter if it's on directories or files...

/foobar? could be either the directory 'foobar', or even a file named 'foobar'. The important thing to remember is what Dave said:

any URL that starts with this value will not be retrieved
3:38 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Yeah. I got that, but it was not my practicle experience of it ;)

Guess I'll just strip down that file and run some more tests...

Thanks guys

Nick

3:38 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member drdoc is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Any time ;)

I also found this:

[google.com...] How do I tell Googlebot not to crawl dynamically generated pages on my site?

The following robots.txt file will achieve this.

User-agent: Googlebot
Disallow: /*?

3:40 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Ooooooooh!

It's like Christmas morning!

nick makes a dive for a shell and Vim....

;-)

Nick

 

Featured Threads

Hot Threads This Week

Hot Threads This Month