homepage Welcome to WebmasterWorld Guest from 54.167.185.110
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.text question, disallowing files
mikeD




msg:1528906
 7:08 pm on Oct 29, 2005 (gmt 0)

is it ok disallow files such as,

Disallow: widgets.php

in your robots.txt, just in a syntax checker it said should only disallow from directories.

 

Dijkgraaf




msg:1528907
 8:54 pm on Oct 30, 2005 (gmt 0)

For starters you should always have a leading / as all requested URL's will start with this, so it should be

Disallow: /widgets.php

You can disallow any resource, in fact what you are disallowing is any URL starting with that item so you could have

Disallow: /widgets

and it would disallow widgets.php, widgets.html, widgets.gif etc.

h8dk70




msg:1528908
 1:56 am on Oct 31, 2005 (gmt 0)

Would this line:

Disallow: /widgets

also disallow something like widgets-and-stuff.php In other words is putting /widgets in robots.txt is equivalent to ls -l /widgets* at the os prompt?

Dijkgraaf




msg:1528909
 2:21 am on Oct 31, 2005 (gmt 0)

Yes.

Animated




msg:1528910
 10:41 pm on Nov 1, 2005 (gmt 0)

do you need to close it with a / too? like Disallow: /widget.php/

Lord Majestic




msg:1528911
 11:02 pm on Nov 1, 2005 (gmt 0)

do you need to close it with a / too? like Disallow: /widget.php/

No - it would not be correct since filename can't really end with / but even if its directory it is wise to NOT include last /'.

jdMorgan




msg:1528912
 11:37 pm on Nov 1, 2005 (gmt 0)

> [even if its a] directory it is wise to NOT include last /'.

An interesting comment. What is the reason for this recommendation?

Jim

Dijkgraaf




msg:1528913
 1:00 am on Nov 2, 2005 (gmt 0)

I think he is recomending this just in case a bot finds a link going to /adirectory without a trailing slash and so it won't match disallow: /adirectory/ and so the bot will then request the URL and then will either get given a redirect to /adirectory/ or would actually be served contents from that directory. It is possible some bots might actually request a URL given in a redirect without checking this new URL against robots.txt

Lord Majestic




msg:1528914
 1:21 am on Nov 2, 2005 (gmt 0)

An interesting comment. What is the reason for this recommendation?

Dijkgraaf is spot on this one - to add some webservers seem NOT to issue redirect so bot won't get a chance to re-check new url (with slash) against robots.txt and thus unintentionally "violate" robots.txt. I had a few of these and ended up removing end slashes from robots.txt's disallow directivies to ensure that my bot won't crawl urls that webmaster clearly wanted not to be crawled even though technically it would have been webmaster's fault.

Not specifying slashes is the wisest way because it catches all possibilities.

jdMorgan




msg:1528915
 1:37 am on Nov 2, 2005 (gmt 0)

Thanks for the very good points.

Fault-tolerant is good!

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved