Welcome to WebmasterWorld Guest from 23.22.220.37

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt specific query

     
12:50 pm on Sep 24, 2001 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator ianturner is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 19, 2001
posts:3449
votes: 10


if a robots.txt file has

useragent *
Disallow: /this.htm

will this disallow the robots from all instances of this.htm within the site or just from the home directory?

1:09 pm on Sept 24, 2001 (gmt 0)

Moderator from DK 

WebmasterWorld Administrator 10+ Year Member

joined:Oct 23, 2000
posts:2530
votes: 1


IMO the syntax you describe would disallow the file - this.htm - no matter where you place it.
But let's see what the experts say ;)

In the mean time you can take a look at WebmasterWorld's own

Robots Checker [searchengineworld.com]

It validates a robots.txt file and has some nice info on robots-txt.

1:21 pm on Sept 24, 2001 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38047
votes: 11


This is a question where ambiguity reigns. According to spec, it will only block it, if the url STARTS with that match:

This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html

useragent *
Disallow: /this.htm

That would only get:
/this.htm

But would also get:
/this.htm/rocks

The problem is, that I don't believe all spiders follow the spec that way. They tend to do sliding regexs.

1:27 pm on Sept 24, 2001 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator ianturner is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 19, 2001
posts:3449
votes: 10


Hmm and I thought this was going to be easy for an expert on robots.txt files.

Oh well of to W3C again - though this probably won't help

6:50 pm on Sept 24, 2001 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2001
posts:562
votes: 0


It is my understanding, that spiders works in very different wayes (and most of the times mysterios). Somtimes it even seemes that some spiders "eat" in directoryes, they are not allowed in.

Regards
Kim (snipped URL.....please no signatures)

(edited by: agerhart at 6:51 pm (gmt) on Sep. 24, 2001

6:55 pm on Sept 24, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member agerhart is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 29, 2001
posts:2945
votes: 0


Bufferzone,

What I think you are referring to are rogue spiders and the ones that do not adhere to or abide by the robots.txt, which is the not the case for the major search engines.

7:16 pm on Sept 24, 2001 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 24, 2001
posts:562
votes: 0


agerhard>>Thank's and sorry for the URL, I now know it is not alowed

Kim

7:19 pm on Sept 24, 2001 (gmt 0)

Senior Member

WebmasterWorld Senior Member agerhart is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 29, 2001
posts:2945
votes: 0


No problem Kim.....I hope that you enjoy the forums. There is a whole lot of great information here.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members