homepage Welcome to WebmasterWorld Guest from 54.147.248.118
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.text question, disallowing files
mikeD

10+ Year Member



 
Msg#: 769 posted 7:08 pm on Oct 29, 2005 (gmt 0)

is it ok disallow files such as,

Disallow: widgets.php

in your robots.txt, just in a syntax checker it said should only disallow from directories.

 

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 769 posted 8:54 pm on Oct 30, 2005 (gmt 0)

For starters you should always have a leading / as all requested URL's will start with this, so it should be

Disallow: /widgets.php

You can disallow any resource, in fact what you are disallowing is any URL starting with that item so you could have

Disallow: /widgets

and it would disallow widgets.php, widgets.html, widgets.gif etc.

h8dk70

5+ Year Member



 
Msg#: 769 posted 1:56 am on Oct 31, 2005 (gmt 0)

Would this line:

Disallow: /widgets

also disallow something like widgets-and-stuff.php In other words is putting /widgets in robots.txt is equivalent to ls -l /widgets* at the os prompt?

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 769 posted 2:21 am on Oct 31, 2005 (gmt 0)

Yes.

Animated

5+ Year Member



 
Msg#: 769 posted 10:41 pm on Nov 1, 2005 (gmt 0)

do you need to close it with a / too? like Disallow: /widget.php/

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 769 posted 11:02 pm on Nov 1, 2005 (gmt 0)

do you need to close it with a / too? like Disallow: /widget.php/

No - it would not be correct since filename can't really end with / but even if its directory it is wise to NOT include last /'.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 769 posted 11:37 pm on Nov 1, 2005 (gmt 0)

> [even if its a] directory it is wise to NOT include last /'.

An interesting comment. What is the reason for this recommendation?

Jim

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 769 posted 1:00 am on Nov 2, 2005 (gmt 0)

I think he is recomending this just in case a bot finds a link going to /adirectory without a trailing slash and so it won't match disallow: /adirectory/ and so the bot will then request the URL and then will either get given a redirect to /adirectory/ or would actually be served contents from that directory. It is possible some bots might actually request a URL given in a redirect without checking this new URL against robots.txt

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 769 posted 1:21 am on Nov 2, 2005 (gmt 0)

An interesting comment. What is the reason for this recommendation?

Dijkgraaf is spot on this one - to add some webservers seem NOT to issue redirect so bot won't get a chance to re-check new url (with slash) against robots.txt and thus unintentionally "violate" robots.txt. I had a few of these and ended up removing end slashes from robots.txt's disallow directivies to ensure that my bot won't crawl urls that webmaster clearly wanted not to be crawled even though technically it would have been webmaster's fault.

Not specifying slashes is the wisest way because it catches all possibilities.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 769 posted 1:37 am on Nov 2, 2005 (gmt 0)

Thanks for the very good points.

Fault-tolerant is good!

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved