Forum Moderators: goodroi
User-agent: *
Disallow: /widgets/
I am obviously trying to block all bots from the widgets directory.
I have created a new directory that I want search engines to visit. That directory is:
/widgets-for-sale/
Can someone tell me if my robots.txt will prevent search engines from visiting /widgets-for-sale/?
I would have thought not but I put up the new directory, (/widgets-for-sale/), and linked to it but have not had search engines visit the new pages after about 10 days.
Thanks.
Therefore,
User-agent: *
Disallow: /widgets/
It will have no effect on URL-paths that *start with* "/widgets-for-sale/"
See A Standard for Robots Exclusion [robotstxt.org].
Jim
I checked and made sure that I linked to the directory with a trailing '/'.
If you have trailing '/' in robots.txt then its possible that some bots will request that directory without such slash, which will technically be 100% legal as far as robots.txt is concerned, it is therefore good idea NOT to have trailing slash in Disallow directives - this covers for all eventualities.
In your case however it would result in the other directory excluded since it starts with the same prefix, tough dilemma, best solution would probably be rename it to avoid clash.
it is therefore good idea NOT to have trailing slash in Disallow directives - this covers for all eventualities
I'm not sure this is accurate. If I want to disallow the directory /widgets.htm/ but NOT the page /widgets.htm I need the trailing slash. Surely omitting it just makes the whole thing less accurate?
The only page a robot could mistakenly access could be the index page for the directory /widgets.htm/, if they asked for just /widgets.htm - but surely this would be incorrect behaviour on the part of the bot since a request for /widgets is asking for a file called widgets, whereas a requests for /widgets/ is asking for the root of the directory?
I'm not sure this is accurate.
It depends on a point of view. Consider the following example:
-----------------------
User-agent: *
Disallow: /dir/
-----------------------
The intention here is to disallow crawling of anything inside /dir/, including root of the directory. Looks fine, but don't forget bots will check if the URL they are about to retrieve starts with the disallow statement. Sounds still good? Not really - consider that it is perfectly valid to request root of directory without specifying / at the end, ie: http://www.example.com/dir - this URL won't be matched given robots.txt above, and its all perfectly valid.
Some or even most webservers will issue a redirect to proper url - http://www.example.com/dir/ - this will give bot chance to match the URL - and it should do, however re-evaluation on redirect is not supported by all bots.
So, the conclusion is that specifying / at the end of Disallow directives may lead to some bots supposedly violating robots.txt, yet in reality they followed it strictly and its the webmaster who is at fault.
it is perfectly valid to request root of directory without specifying / at the end, ie: http://www.example.com/dir
I think this is the statement I disagree with - http://www.example.com/dir is a request for a file called 'dir' in the root of www.example.com. Most web servers (if a file called 'dir' does not exist) perform a 'courtesy' redirect to the actual directory URL - /dir/ but robots should not request directories without the slash since this a request for a file, not for a directory index.
I don't know if the index file in the root directory is an http spec thing, I suppose more properly the links would be to /dir/index.htm.
"Directories require a trailing slash"
[httpd.apache.org...]
"IIS first treats [dir] as a file that it should give back to the browser. If this file cannot be found, IIS checks to see if there is a directory with this name"
[support.microsoft.com...]
I think this is the statement I disagree with - http://www.example.com/dir is a request for a file called 'dir' in the root of www.example.com.
This is not correct - http://www.example.com/dir is a URI - client that requests it has no clue as to whether it directory or file or whatever - this is totally up to server how it classifies them, as far as client is concerned it requested http://www.example.com/dir and its a perfectly valid request.
Indeed some or even many webservers will issue redirect - but not all bots re-check robots.txt upon redirect and some servers don't issue redirect at all - they just detect its directory and avoid cost of redirect.
It is therefore better to avoid having trailing slash in Disallow statements. This is not theory - I have personally come across with a few webmasters who claimed my bot violated their robots.txt but upon check it transpired that the situation described above happened and it was technically their fault.
Recapping the situation. My robots.txt:
User-agent: *
Disallow: /widgets/
My directory was called:
/widgets-for-sale/
but it was not being indexed (and I wanted it to be). So I changed the directory to:
/SynonymForWidgets-for-sale/
The directory was crawled the next day by both googlebot and msn which had not crawled the /widgets-for-sale/ directory in the 10 previous days.
This leads me to believe that my inital robots.txt was somehow telling bots not to visit that directory. I don't understand how these things work, just passing on what I saw and hopefully that can help someone else as well.
So, in case of my bot it would skip both directories and it seems logical to me that other search engines may also take the same approach.