Welcome to WebmasterWorld Guest from 54.162.248.199

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.text question, disallowing files

     
7:08 pm on Oct 29, 2005 (gmt 0)

10+ Year Member



is it ok disallow files such as,

Disallow: widgets.php

in your robots.txt, just in a syntax checker it said should only disallow from directories.

8:54 pm on Oct 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For starters you should always have a leading / as all requested URL's will start with this, so it should be

Disallow: /widgets.php

You can disallow any resource, in fact what you are disallowing is any URL starting with that item so you could have

Disallow: /widgets

and it would disallow widgets.php, widgets.html, widgets.gif etc.

1:56 am on Oct 31, 2005 (gmt 0)

5+ Year Member



Would this line:

Disallow: /widgets

also disallow something like widgets-and-stuff.php In other words is putting /widgets in robots.txt is equivalent to ls -l /widgets* at the os prompt?

2:21 am on Oct 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes.
10:41 pm on Nov 1, 2005 (gmt 0)

5+ Year Member



do you need to close it with a / too? like Disallow: /widget.php/
11:02 pm on Nov 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



do you need to close it with a / too? like Disallow: /widget.php/

No - it would not be correct since filename can't really end with / but even if its directory it is wise to NOT include last /'.

11:37 pm on Nov 1, 2005 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



> [even if its a] directory it is wise to NOT include last /'.

An interesting comment. What is the reason for this recommendation?

Jim

1:00 am on Nov 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think he is recomending this just in case a bot finds a link going to /adirectory without a trailing slash and so it won't match disallow: /adirectory/ and so the bot will then request the URL and then will either get given a redirect to /adirectory/ or would actually be served contents from that directory. It is possible some bots might actually request a URL given in a redirect without checking this new URL against robots.txt
1:21 am on Nov 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



An interesting comment. What is the reason for this recommendation?

Dijkgraaf is spot on this one - to add some webservers seem NOT to issue redirect so bot won't get a chance to re-check new url (with slash) against robots.txt and thus unintentionally "violate" robots.txt. I had a few of these and ended up removing end slashes from robots.txt's disallow directivies to ensure that my bot won't crawl urls that webmaster clearly wanted not to be crawled even though technically it would have been webmaster's fault.

Not specifying slashes is the wisest way because it catches all possibilities.

1:37 am on Nov 2, 2005 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Thanks for the very good points.

Fault-tolerant is good!

Jim

 

Featured Threads

Hot Threads This Week

Hot Threads This Month